fix(plan): stateless re-plan must not flag healthy host as recovery#36
Merged
fix(plan): stateless re-plan must not flag healthy host as recovery#36
Conversation
`src/cli/plan.rs:131-135` and `src/cli/apply.rs:909-913` chose between
`ApplyRunDisplayState::FirstRun` and `::Recovery` based on a binary
heuristic on the observed snapshot:
let run_display_state = if observed_snapshot.objects.is_empty() {
ApplyRunDisplayState::FirstRun
} else {
ApplyRunDisplayState::Recovery
};
In stateless mode (`--source-repo`), there is no `/var/lib/core-ops/`
baseline by design (FR-013 / SC-009), so the controller cannot tell
"host has units because a prior apply succeeded" from "host has units
because a prior apply failed mid-flight." The heuristic therefore
emitted `(recovery from failed initial apply)` in the rendered header
on **every** stateless invocation against a host that has converged
state — including immediately after a perfectly clean
`Outcome: converged` apply. End-to-end this looked like:
$ core-ops apply --source-repo examples/03-immich --host example
Apply for host example @ (stateless) (first run)
... 10 created ...
Outcome: converged
$ core-ops plan --source-repo examples/03-immich --host example
Plan for host example @ (stateless) (recovery from failed initial apply)
[·] Unchanged • 10
Summary: 10 unchanged
The "recovery from failed initial apply" tagline directly contradicts
the immediately preceding `Outcome: converged` and "10 unchanged"
signals, undercutting the demonstrated trustworthiness of the
canonical `examples/03-immich` walkthrough that spec/018 ships in the
README and `docs/onboarding.cast`.
The comment immediately above the buggy heuristic stated the correct
intent — "Treat as FirstRun for header-rendering purposes — the
source path is always the primary identifier in the rendered header"
— but the implementation diverged. This change makes the
implementation match the comment's intent: in stateless mode the
path-based provenance prefix (`(stateless)`) carries the relevant
signal alone; the first-run / recovery suffix does not apply once
the host has observed objects.
Resolution:
- Add `ApplyRunDisplayState::Stateless` (`src/cli/report.rs:33`).
Renders no suffix beyond the version identifier; the caller's
`(stateless)` provenance prefix is preserved.
- Update both stateless call sites (`src/cli/plan.rs`,
`src/cli/apply.rs`) to select `Stateless` when the host has
observed objects, falling back to `FirstRun` when the host is
empty (the latter case keeps "(first run)" as a useful initial-
apply signal).
- The init'd-mode helpers
(`classify_plan_run_display_state`, `classify_apply_run_display_state`)
are unchanged: their heuristic is correct because they DO consult
`last_applied_revision`, which can disambiguate first-run /
recovery / managed states.
- Added unit test
`tests/integration/test_apply_report.rs::stateless_run_display_state_omits_first_run_and_recovery_suffixes`
locking in the new behavior.
Verification on `core-ops-uat` (Fedora CoreOS guest):
Empty host:
$ core-ops plan --source-repo examples/03-immich --host example
Plan for host example @ (stateless) (first run)
$ sudo core-ops apply --source-repo examples/03-immich --host example
Apply for host example @ (stateless) (first run)
Converged host:
$ core-ops plan --source-repo examples/03-immich --host example
Plan for host example @ (stateless)
[·] Unchanged • 10
Patch bump (modify under `src/`, `rust_source_surface` rule).
Discovered while exercising the canonical Immich walkthrough for
spec/018 session 3; the misleading "(recovery from failed initial
apply)" landed in the spec/018 asciicast recording, directly
undercutting the slice's "increase adoption likelihood" goal.
$ cargo test 473 passed
$ cargo clippy --all-targets -- -D warnings clean
$ cargo run --bin core-ops-release -- validate --base-ref master passed (patch)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
outergod
added a commit
that referenced
this pull request
May 7, 2026
Rebases onto post-promote master at v2.2.3 (which now contains both spec/018-blocking fixes: PR #35 wired the immich-db-password Podman secret into immich-server.container, and PR #36 introduced ApplyRunDisplayState::Stateless so stateless re-plans no longer flag a healthy host as "(recovery from failed initial apply)"). Re-recording on `core-ops-uat` (Fedora CoreOS guest) against the fixed binary + fixed example produces the truthful narrative the spec/018 walkthrough is supposed to demonstrate: Beat 1 (plan): Plan for host example @ (stateless) (first run) Beat 2 (apply): Apply for host example @ (stateless) (first run) Outcome: converged Beat 3 (replan): Plan for host example @ (stateless) 10 unchanged immich-server reaches `active running` (NRestarts=1) and journal shows "Immich Microservices is running [v2.7.5] [production]" — no auth restart loop. The misleading "(recovery from failed initial apply)" suffix is gone. Cast post-processed to strip OSC 3008 sequences (pam_systemd hostname/machineid leak under sudo) and replace nix-store-path SHELL env value, per decision_018-recording-ssh-delegation. Cast duration 4.80 s, ≤ 90 s SC-005a budget. T013 README walkthrough block updated to keep verbatim fidelity to the new T012 capture: the second fenced block's header and underline now match the corrected output (no "(recovery)" suffix; underline length adjusted from 72 chars to 35 chars to match the shorter title). Block count = 2 (FR-006), combined non-blank lines = 25 (SC-007b), unique unit identifiers = 5 (SC-007). The rebase dropped the prior `chore(release): bump 2.2.1 -> 2.2.2` commit (subsumed by master's PR #34/#35/#36 promotes); spec/018 now bumps `2.2.3 -> 2.2.4` per `packaged_readme_surface` carve-out. Verification: $ cargo test 473 passed $ cargo clippy --all-targets -- -D warnings clean $ cargo run --bin core-ops-release -- validate --base-ref master passed (patch) $ wc -l README.md 297 (≤ 400) PASS SC-001 $ <walkthrough section>: 2 fenced blocks, 25 non-blank lines, 5 unit IDs PASS FR-006/SC-007/SC-007b $ head -n 1 docs/onboarding.cast | jq '.version' 2 PASS SC-005 $ head -n 1 docs/onboarding.cast | jq '.duration' 4.80 (≤ 90) PASS SC-005a $ grep -iE '(not\.one|ulthar|192\.168\.|10\.0\.|172\.16\.)' docs/onboarding.cast docs/onboarding-script.sh (no matches) PASS SC-006a $ grep -c '3008' docs/onboarding.cast 0 PASS FR-009a $ asciinema play docs/onboarding.cast plays end-to-end PASS Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
outergod
added a commit
that referenced
this pull request
May 7, 2026
… playback Operator reported staircase distortion in the rendered cast/GIF: each successive line started indented by the length of the previous line, the classic "LF without CR" pattern. Root cause: the inner recording script piped SSH output through `| sed 's/\r$//'`, which stripped every `\r` that the remote PTY's ONLCR translation had added. The captured cast events therefore contained bare `\n` (line feed only). asciinema 2.4.0 plays casts through a raw-mode local TTY (so it can faithfully replay any escape sequences without double-translation), which means LF stays LF on output — the terminal treats each LF as "down one row" with no column reset, hence the staircase. The original sed filter was probably defensive — line-buffered tools sometimes choke on trailing `\r` — but for asciinema-capture the CRs are essential. Drop the filter and re-record. After re-recording the plan-output event has 124 CRs and 124 LFs (matched), confirming `\r\n` line terminators throughout. `asciinema cat docs/onboarding.cast` now renders left-aligned. The GIF rendered from this cast (`docs/assets/core-ops-demo.gif`, 107 KB GIF89a) displays the correct terminal layout inline on GitHub. Cast re-recorded against the same fixed stack (PRs #34/#35/#36 merged): Beat 1 (plan): Plan for host example @ (stateless) (first run) Beat 2 (apply): Apply for host example @ (stateless) (first run) Outcome: converged Beat 3 (replan): Plan for host example @ (stateless) 10 unchanged Same OSC 3008 strip + SHELL sanitize post-processing applied (per decision_018-recording-ssh-delegation). Duration 4.88 s, well under SC-005a's 90 s budget. Verification: $ head -n 1 docs/onboarding.cast | jq '.version' 2 $ head -n 1 docs/onboarding.cast | jq '.duration' 4.88012 (≤ 90) $ grep -c '3008' docs/onboarding.cast 0 $ grep -iE '(not\.one|ulthar|192\.168\.|10\.0\.|172\.16\.)' docs/onboarding.cast docs/onboarding-script.sh (no matches) $ asciinema cat docs/onboarding.cast | head -3 left-aligned, no staircase $ wc -c < docs/assets/core-ops-demo.gif 106777 (≤ 1 MB) The non-tracked recording driver `/tmp/onboarding-inner.sh` carried the buggy sed pipe in this session; the canonical `docs/onboarding-script.sh` does NOT have a sed filter (its inner `demo()` runs commands directly via `eval`, not through SSH delegation). The bug therefore lives only in the SSH-delegation recording procedure documented in decision_018-recording-ssh-delegation, not in the canonical regeneration script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
outergod
added a commit
that referenced
this pull request
May 7, 2026
Rebases onto post-promote master at v2.2.3 (which now contains both spec/018-blocking fixes: PR #35 wired the immich-db-password Podman secret into immich-server.container, and PR #36 introduced ApplyRunDisplayState::Stateless so stateless re-plans no longer flag a healthy host as "(recovery from failed initial apply)"). Re-recording on `core-ops-uat` (Fedora CoreOS guest) against the fixed binary + fixed example produces the truthful narrative the spec/018 walkthrough is supposed to demonstrate: Beat 1 (plan): Plan for host example @ (stateless) (first run) Beat 2 (apply): Apply for host example @ (stateless) (first run) Outcome: converged Beat 3 (replan): Plan for host example @ (stateless) 10 unchanged immich-server reaches `active running` (NRestarts=1) and journal shows "Immich Microservices is running [v2.7.5] [production]" — no auth restart loop. The misleading "(recovery from failed initial apply)" suffix is gone. Cast post-processed to strip OSC 3008 sequences (pam_systemd hostname/machineid leak under sudo) and replace nix-store-path SHELL env value, per decision_018-recording-ssh-delegation. Cast duration 4.80 s, ≤ 90 s SC-005a budget. T013 README walkthrough block updated to keep verbatim fidelity to the new T012 capture: the second fenced block's header and underline now match the corrected output (no "(recovery)" suffix; underline length adjusted from 72 chars to 35 chars to match the shorter title). Block count = 2 (FR-006), combined non-blank lines = 25 (SC-007b), unique unit identifiers = 5 (SC-007). The rebase dropped the prior `chore(release): bump 2.2.1 -> 2.2.2` commit (subsumed by master's PR #34/#35/#36 promotes); spec/018 now bumps `2.2.3 -> 2.2.4` per `packaged_readme_surface` carve-out. Verification: $ cargo test 473 passed $ cargo clippy --all-targets -- -D warnings clean $ cargo run --bin core-ops-release -- validate --base-ref master passed (patch) $ wc -l README.md 297 (≤ 400) PASS SC-001 $ <walkthrough section>: 2 fenced blocks, 25 non-blank lines, 5 unit IDs PASS FR-006/SC-007/SC-007b $ head -n 1 docs/onboarding.cast | jq '.version' 2 PASS SC-005 $ head -n 1 docs/onboarding.cast | jq '.duration' 4.80 (≤ 90) PASS SC-005a $ grep -iE '(not\.one|ulthar|192\.168\.|10\.0\.|172\.16\.)' docs/onboarding.cast docs/onboarding-script.sh (no matches) PASS SC-006a $ grep -c '3008' docs/onboarding.cast 0 PASS FR-009a $ asciinema play docs/onboarding.cast plays end-to-end PASS Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
outergod
added a commit
that referenced
this pull request
May 7, 2026
… playback Operator reported staircase distortion in the rendered cast/GIF: each successive line started indented by the length of the previous line, the classic "LF without CR" pattern. Root cause: the inner recording script piped SSH output through `| sed 's/\r$//'`, which stripped every `\r` that the remote PTY's ONLCR translation had added. The captured cast events therefore contained bare `\n` (line feed only). asciinema 2.4.0 plays casts through a raw-mode local TTY (so it can faithfully replay any escape sequences without double-translation), which means LF stays LF on output — the terminal treats each LF as "down one row" with no column reset, hence the staircase. The original sed filter was probably defensive — line-buffered tools sometimes choke on trailing `\r` — but for asciinema-capture the CRs are essential. Drop the filter and re-record. After re-recording the plan-output event has 124 CRs and 124 LFs (matched), confirming `\r\n` line terminators throughout. `asciinema cat docs/onboarding.cast` now renders left-aligned. The GIF rendered from this cast (`docs/assets/core-ops-demo.gif`, 107 KB GIF89a) displays the correct terminal layout inline on GitHub. Cast re-recorded against the same fixed stack (PRs #34/#35/#36 merged): Beat 1 (plan): Plan for host example @ (stateless) (first run) Beat 2 (apply): Apply for host example @ (stateless) (first run) Outcome: converged Beat 3 (replan): Plan for host example @ (stateless) 10 unchanged Same OSC 3008 strip + SHELL sanitize post-processing applied (per decision_018-recording-ssh-delegation). Duration 4.88 s, well under SC-005a's 90 s budget. Verification: $ head -n 1 docs/onboarding.cast | jq '.version' 2 $ head -n 1 docs/onboarding.cast | jq '.duration' 4.88012 (≤ 90) $ grep -c '3008' docs/onboarding.cast 0 $ grep -iE '(not\.one|ulthar|192\.168\.|10\.0\.|172\.16\.)' docs/onboarding.cast docs/onboarding-script.sh (no matches) $ asciinema cat docs/onboarding.cast | head -3 left-aligned, no staircase $ wc -c < docs/assets/core-ops-demo.gif 106777 (≤ 1 MB) The non-tracked recording driver `/tmp/onboarding-inner.sh` carried the buggy sed pipe in this session; the canonical `docs/onboarding-script.sh` does NOT have a sed filter (its inner `demo()` runs commands directly via `eval`, not through SSH delegation). The bug therefore lives only in the SSH-delegation recording procedure documented in decision_018-recording-ssh-delegation, not in the canonical regeneration script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
src/cli/plan.rs:131-135andsrc/cli/apply.rs:909-913selectedApplyRunDisplayState::Recovery(rendered as(recovery from failed initial apply)) for every stateless invocation against a host that already has observed objects — including immediately after a clean `Outcome: converged` apply.ApplyRunDisplayState::Statelessthat renders no suffix beyond the version identifier. Stateless plan/apply now selectStatelesswhen the host has observed objects, falling back toFirstRunwhen the host is empty (the latter case keeps "(first run)" as a useful initial-apply signal).last_applied_revision.rust_source_surfaceforsrc/modify); fixture pin updated in lock-step.Discovered while exercising the canonical Immich walkthrough for spec/018 (the next slice's recording will replay against this fix). The misleading suffix had been landing in
docs/onboarding.castand directly undercut the slice's "increase adoption likelihood" goal.Test plan
`tests/integration/test_apply_report.rs::stateless_run_display_state_omits_first_run_and_recovery_suffixes`).
`Plan for host example @ (stateless) (first run)` (preserved).
`Plan for host example @ (stateless)` (no misleading suffix).
`Apply for host example @ (stateless) (first run)`.
🤖 Generated with Claude Code