Problem
engine.pid (written by runs.write_pid) stores only the bare numeric PID. Operating systems recycle PIDs, so after an engine crashes/is kill -9'd without the run being marked finished (the stale engine.pid is never deleted), the OS can reassign that PID to an unrelated process. Then:
runs.engine_alive() (pid-only — no tmux fallback for a present pid) can false-positive, blocking resume/delete/archive with "still running".
runs.stop_run() → platform_util.terminate_pid(pid) can send SIGTERM to the recycled, unrelated process.
Blast radius is bounded (SIGTERM not SIGKILL; os.kill only succeeds on a same-UID process), and the conjunction (hard crash + PID wraparound + action on that same stale run) is low-probability, but the wrong-process termination is a real correctness hazard.
Surfaced by CodeRabbit on #11 (comment 3472621208). Pre-existing pattern — not introduced by that PR. The pid <= 0 guard from that review was addressed in 6492cdf; this issue tracks the remaining identity-verification work.
Proposed fix
Persist a per-process identity token alongside the PID and verify it before trusting liveness or terminating:
platform_util.pid_identity(pid) -> str | None — process start-time. Linux: /proc/<pid>/stat field 22 (bare /proc read, # portability:-acked + added to the portability-guard allowlist). macOS/Windows: psutil.Process(pid).create_time() (the existing optional non-linux extra). Mirrors the existing unity_teardown proc/psutil split.
write_pid → "<pid> <start_time>"; read_pid returns (pid, token) (ripples to engine_alive, stop_run, and tui/data.py which reads engine.pid directly).
- Verify
pid_identity(pid) == stored_token in engine_alive / stop_run / data.liveness; a mismatch means the PID was recycled → treat as not-ours.
- Legacy tolerance: existing
engine.pid files are bare ints — absent token must degrade to today's behavior, not crash.
Affected files
src/automator/platform_util.py, src/automator/runs.py, src/automator/tui/data.py, tests/test_runs.py, tests/test_tui_data.py, tests/test_portability_guard.py (+ new pid_identity tests).
Problem
engine.pid(written byruns.write_pid) stores only the bare numeric PID. Operating systems recycle PIDs, so after an engine crashes/iskill -9'd without the run being marked finished (the staleengine.pidis never deleted), the OS can reassign that PID to an unrelated process. Then:runs.engine_alive()(pid-only — no tmux fallback for a present pid) can false-positive, blocking resume/delete/archive with "still running".runs.stop_run()→platform_util.terminate_pid(pid)can send SIGTERM to the recycled, unrelated process.Blast radius is bounded (SIGTERM not SIGKILL;
os.killonly succeeds on a same-UID process), and the conjunction (hard crash + PID wraparound + action on that same stale run) is low-probability, but the wrong-process termination is a real correctness hazard.Surfaced by CodeRabbit on #11 (comment 3472621208). Pre-existing pattern — not introduced by that PR. The
pid <= 0guard from that review was addressed in 6492cdf; this issue tracks the remaining identity-verification work.Proposed fix
Persist a per-process identity token alongside the PID and verify it before trusting liveness or terminating:
platform_util.pid_identity(pid) -> str | None— process start-time. Linux:/proc/<pid>/statfield 22 (bare/procread,# portability:-acked + added to the portability-guard allowlist). macOS/Windows:psutil.Process(pid).create_time()(the existing optionalnon-linuxextra). Mirrors the existingunity_teardownproc/psutil split.write_pid→"<pid> <start_time>";read_pidreturns(pid, token)(ripples toengine_alive,stop_run, andtui/data.pywhich readsengine.piddirectly).pid_identity(pid) == stored_tokeninengine_alive/stop_run/data.liveness; a mismatch means the PID was recycled → treat as not-ours.engine.pidfiles are bare ints — absent token must degrade to today's behavior, not crash.Affected files
src/automator/platform_util.py,src/automator/runs.py,src/automator/tui/data.py,tests/test_runs.py,tests/test_tui_data.py,tests/test_portability_guard.py(+ newpid_identitytests).