Skip to content

refactor+reliability: extract atomic helpers + lock-protect decide.py#17

Open
CryptoJones wants to merge 1 commit into
mainfrom
refactor/shared-atomic-write-and-decide-lock
Open

refactor+reliability: extract atomic helpers + lock-protect decide.py#17
CryptoJones wants to merge 1 commit into
mainfrom
refactor/shared-atomic-write-and-decide-lock

Conversation

@CryptoJones
Copy link
Copy Markdown
Owner

Pulls the read/write atomicity pattern into one place and uses it to
fix two new failure modes in socrates decide.

New module _atomic.py:

  • atomic_write_text(path, text) — tempfile + os.replace. Same
    pattern as the in-flight interview-atomic-save and patterns-cache-
    atomic-save fixes; centralized here so a third caller (decide) and
    any future ones don't reinvent it.
  • locked_read_modify_write(path, mutate) — exclusive read-modify-
    write guarded by a POSIX file lock. Holds the lock through the
    entire read → mutate → atomic-write cycle, so concurrent callers
    serialize.

Two subtleties worth flagging for review:

  1. The lock is taken on a SIBLING .<name>.lock file, NOT on path
    itself. We initially tried locking path, but atomic_write_text
    renames a tempfile onto it — which orphans the old inode and the
    flock with it. A second worker would then acquire the (stale)
    old-inode lock and read pre-rename content, producing a silent
    lost update. The sibling lock file's inode is stable across
    renames, so serialization is preserved. There's a regression test
    that proves this (timing assertion: two 300ms holds must take
    ≥500ms in serial).

  2. On non-POSIX (no fcntl) the lock silently no-ops, matching the
    pre-fix unlocked behavior. The atomic-write half still applies,
    so a mid-write SIGINT doesn't corrupt the file. No regression on
    Windows; just no new concurrency protection.

socrates decide now uses locked_read_modify_write. This fixes:

  • Data loss under concurrent writes. Pre-fix: two socrates decide
    invocations both read the same pre-state, both compute new content,
    and the second writer silently clobbers the first. An operator
    scripting batch decisions (or a CI hook) loses decisions with no
    error to detect them.
  • Corruption under SIGINT. Pre-fix: Ctrl-C during the
    decisions_path.write_text(new_body) call leaves DECISIONS.md
    truncated — socrates timeline then crashes parsing the file and
    the project audit reports it broken.

Also folds in the multi-line whitespace collapse from the
bugfix/decide-multiline-collapse branch so this PR is self-contained
(" ".join(text.split())) — a multi-line socrates decide $'foo\\nbar'
no longer produces a broken bullet whose closing ** lands on a
different line.

Tests (new tests/test_atomic.py — 8 tests + 2 in test_decide.py):

atomic_write_text:

  • writes full content with UTF-8 by default (em-dashes, smart
    quotes, 漢字 round-trip)
  • overwrites existing files
  • leaves no .tmp on success
  • mid-rename failure: pre-existing data untouched, .tmp cleaned

locked_read_modify_write:

  • single-process: mutate sees current text, write lands
  • non-existent file raises FileNotFoundError up-front
  • TWO subprocesses each holding the lock for 300ms must both land
    their updates (no lost-update) AND total elapsed time is ≥500ms
    (proving lock-blocking, not just luck)

decide integration:

  • two concurrent record_decision calls keep BOTH decisions
    (subprocess test; skipped on Windows)

156 tests pass total (was 149); ruff + mypy strict clean.

Future work: interview.py and patterns.py have their own copies of
the atomic-write logic from in-flight PRs. Once those merge, a
follow-on can dedupe both call sites onto _atomic.atomic_write_text.

Merge-order note (important): this PR overlaps with two in-flight ones:

Cleanest sequence:

  1. Merge this PR first to introduce socrates120x/_atomic.py.
  2. Then rebase reliability(interview): atomic save + graceful recovery from corrupt resume #5 and reliability(patterns): atomic save of usage cache #14 to use from socrates120x._atomic import atomic_write_text and drop their private copies.

Functionally each PR is correct on its own; the order only matters for diff cleanliness.

Also folds in the multi-line whitespace collapse from #4 (bugfix/decide-multiline-collapse) so this PR is self-contained on the decide.py side. If you merge this first, #4 becomes a no-op.

Pulls the read/write atomicity pattern into one place and uses it to
fix two new failure modes in `socrates decide`.

New module `_atomic.py`:

- `atomic_write_text(path, text)` — tempfile + os.replace. Same
  pattern as the in-flight interview-atomic-save and patterns-cache-
  atomic-save fixes; centralized here so a third caller (decide) and
  any future ones don't reinvent it.
- `locked_read_modify_write(path, mutate)` — exclusive read-modify-
  write guarded by a POSIX file lock. Holds the lock through the
  entire read → mutate → atomic-write cycle, so concurrent callers
  serialize.

Two subtleties worth flagging for review:

1. The lock is taken on a SIBLING `.<name>.lock` file, NOT on `path`
   itself. We initially tried locking `path`, but `atomic_write_text`
   renames a tempfile onto it — which orphans the old inode and the
   flock with it. A second worker would then acquire the (stale)
   old-inode lock and read pre-rename content, producing a silent
   lost update. The sibling lock file's inode is stable across
   renames, so serialization is preserved. There's a regression test
   that proves this (timing assertion: two 300ms holds must take
   ≥500ms in serial).

2. On non-POSIX (no `fcntl`) the lock silently no-ops, matching the
   pre-fix unlocked behavior. The atomic-write half still applies,
   so a mid-write SIGINT doesn't corrupt the file. No regression on
   Windows; just no new concurrency protection.

`socrates decide` now uses `locked_read_modify_write`. This fixes:

- Data loss under concurrent writes. Pre-fix: two `socrates decide`
  invocations both read the same pre-state, both compute new content,
  and the second writer silently clobbers the first. An operator
  scripting batch decisions (or a CI hook) loses decisions with no
  error to detect them.
- Corruption under SIGINT. Pre-fix: Ctrl-C during the
  `decisions_path.write_text(new_body)` call leaves DECISIONS.md
  truncated — `socrates timeline` then crashes parsing the file and
  the project audit reports it broken.

Also folds in the multi-line whitespace collapse from the
bugfix/decide-multiline-collapse branch so this PR is self-contained
(`" ".join(text.split())`) — a multi-line `socrates decide $'foo\\nbar'`
no longer produces a broken bullet whose closing `**` lands on a
different line.

Tests (new tests/test_atomic.py — 8 tests + 2 in test_decide.py):

`atomic_write_text`:
- writes full content with UTF-8 by default (em-dashes, smart
  quotes, 漢字 round-trip)
- overwrites existing files
- leaves no .tmp on success
- mid-rename failure: pre-existing data untouched, .tmp cleaned

`locked_read_modify_write`:
- single-process: mutate sees current text, write lands
- non-existent file raises FileNotFoundError up-front
- TWO subprocesses each holding the lock for 300ms must both land
  their updates (no lost-update) AND total elapsed time is ≥500ms
  (proving lock-blocking, not just luck)

`decide` integration:
- two concurrent `record_decision` calls keep BOTH decisions
  (subprocess test; skipped on Windows)

156 tests pass total (was 149); ruff + mypy strict clean.

Future work: interview.py and patterns.py have their own copies of
the atomic-write logic from in-flight PRs. Once those merge, a
follow-on can dedupe both call sites onto `_atomic.atomic_write_text`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant