Skip to content

fix(plan): stateless re-plan must not flag healthy host as recovery#36

Merged
outergod merged 1 commit intomasterfrom
fix/stateless-plan-annotation
May 7, 2026
Merged

fix(plan): stateless re-plan must not flag healthy host as recovery#36
outergod merged 1 commit intomasterfrom
fix/stateless-plan-annotation

Conversation

@outergod
Copy link
Copy Markdown
Owner

@outergod outergod commented May 7, 2026

Summary

  • src/cli/plan.rs:131-135 and src/cli/apply.rs:909-913 selected ApplyRunDisplayState::Recovery (rendered as (recovery from failed initial apply)) for every stateless invocation against a host that already has observed objects — including immediately after a clean `Outcome: converged` apply.
  • The comment above the buggy heuristic stated the correct intent ("Treat as FirstRun for header-rendering purposes — the source path is always the primary identifier in the rendered header") but the implementation diverged. In stateless mode the controller cannot distinguish "host has units because a prior apply succeeded" from "host has units because a prior apply failed mid-flight," so the recovery suffix is structurally misleading.
  • Add ApplyRunDisplayState::Stateless that renders no suffix beyond the version identifier. Stateless plan/apply now select Stateless when the host has observed objects, falling back to FirstRun when the host is empty (the latter case keeps "(first run)" as a useful initial-apply signal).
  • Init'd-mode helpers are unchanged — their heuristic is correct because they consult last_applied_revision.
  • Patch bump (rust_source_surface for src/ modify); fixture pin updated in lock-step.

Discovered while exercising the canonical Immich walkthrough for spec/018 (the next slice's recording will replay against this fix). The misleading suffix had been landing in docs/onboarding.cast and directly undercut the slice's "increase adoption likelihood" goal.

Test plan

  • `cargo test` — 473 passed (one new unit test:
    `tests/integration/test_apply_report.rs::stateless_run_display_state_omits_first_run_and_recovery_suffixes`).
  • `cargo clippy --all-targets -- -D warnings` — clean.
  • `cargo run --bin core-ops-release -- validate --base-ref master` — passed (patch).
  • End-to-end on `core-ops-uat` (Fedora CoreOS guest), empty host:
    `Plan for host example @ (stateless) (first run)` (preserved).
  • End-to-end on `core-ops-uat`, converged host:
    `Plan for host example @ (stateless)` (no misleading suffix).
  • End-to-end on `core-ops-uat`, apply on empty host:
    `Apply for host example @ (stateless) (first run)`.
  • CI green.
  • Post-merge: spec/018 picks this up via rebase + re-records `docs/onboarding.cast` against the corrected output.

🤖 Generated with Claude Code

`src/cli/plan.rs:131-135` and `src/cli/apply.rs:909-913` chose between
`ApplyRunDisplayState::FirstRun` and `::Recovery` based on a binary
heuristic on the observed snapshot:

    let run_display_state = if observed_snapshot.objects.is_empty() {
        ApplyRunDisplayState::FirstRun
    } else {
        ApplyRunDisplayState::Recovery
    };

In stateless mode (`--source-repo`), there is no `/var/lib/core-ops/`
baseline by design (FR-013 / SC-009), so the controller cannot tell
"host has units because a prior apply succeeded" from "host has units
because a prior apply failed mid-flight." The heuristic therefore
emitted `(recovery from failed initial apply)` in the rendered header
on **every** stateless invocation against a host that has converged
state — including immediately after a perfectly clean
`Outcome: converged` apply. End-to-end this looked like:

    $ core-ops apply --source-repo examples/03-immich --host example
      Apply for host example @ (stateless) (first run)
      ... 10 created ...
      Outcome: converged
    $ core-ops plan --source-repo examples/03-immich --host example
      Plan for host example @ (stateless) (recovery from failed initial apply)
      [·] Unchanged • 10
      Summary: 10 unchanged

The "recovery from failed initial apply" tagline directly contradicts
the immediately preceding `Outcome: converged` and "10 unchanged"
signals, undercutting the demonstrated trustworthiness of the
canonical `examples/03-immich` walkthrough that spec/018 ships in the
README and `docs/onboarding.cast`.

The comment immediately above the buggy heuristic stated the correct
intent — "Treat as FirstRun for header-rendering purposes — the
source path is always the primary identifier in the rendered header"
— but the implementation diverged. This change makes the
implementation match the comment's intent: in stateless mode the
path-based provenance prefix (`(stateless)`) carries the relevant
signal alone; the first-run / recovery suffix does not apply once
the host has observed objects.

Resolution:

- Add `ApplyRunDisplayState::Stateless` (`src/cli/report.rs:33`).
  Renders no suffix beyond the version identifier; the caller's
  `(stateless)` provenance prefix is preserved.
- Update both stateless call sites (`src/cli/plan.rs`,
  `src/cli/apply.rs`) to select `Stateless` when the host has
  observed objects, falling back to `FirstRun` when the host is
  empty (the latter case keeps "(first run)" as a useful initial-
  apply signal).
- The init'd-mode helpers
  (`classify_plan_run_display_state`, `classify_apply_run_display_state`)
  are unchanged: their heuristic is correct because they DO consult
  `last_applied_revision`, which can disambiguate first-run /
  recovery / managed states.
- Added unit test
  `tests/integration/test_apply_report.rs::stateless_run_display_state_omits_first_run_and_recovery_suffixes`
  locking in the new behavior.

Verification on `core-ops-uat` (Fedora CoreOS guest):

    Empty host:
    $ core-ops plan --source-repo examples/03-immich --host example
      Plan for host example @ (stateless) (first run)
    $ sudo core-ops apply --source-repo examples/03-immich --host example
      Apply for host example @ (stateless) (first run)

    Converged host:
    $ core-ops plan --source-repo examples/03-immich --host example
      Plan for host example @ (stateless)
      [·] Unchanged • 10

Patch bump (modify under `src/`, `rust_source_surface` rule).

Discovered while exercising the canonical Immich walkthrough for
spec/018 session 3; the misleading "(recovery from failed initial
apply)" landed in the spec/018 asciicast recording, directly
undercutting the slice's "increase adoption likelihood" goal.

$ cargo test                                                                  473 passed
$ cargo clippy --all-targets -- -D warnings                                   clean
$ cargo run --bin core-ops-release -- validate --base-ref master              passed (patch)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@outergod outergod merged commit 9198f54 into master May 7, 2026
5 of 6 checks passed
@outergod outergod deleted the fix/stateless-plan-annotation branch May 7, 2026 07:58
outergod added a commit that referenced this pull request May 7, 2026


Rebases onto post-promote master at v2.2.3 (which now contains both
spec/018-blocking fixes: PR #35 wired the immich-db-password Podman
secret into immich-server.container, and PR #36 introduced
ApplyRunDisplayState::Stateless so stateless re-plans no longer
flag a healthy host as "(recovery from failed initial apply)").

Re-recording on `core-ops-uat` (Fedora CoreOS guest) against the
fixed binary + fixed example produces the truthful narrative the
spec/018 walkthrough is supposed to demonstrate:

  Beat 1 (plan):   Plan for host example @ (stateless) (first run)
  Beat 2 (apply):  Apply for host example @ (stateless) (first run)
                   Outcome: converged
  Beat 3 (replan): Plan for host example @ (stateless)
                   10 unchanged

immich-server reaches `active running` (NRestarts=1) and journal
shows "Immich Microservices is running [v2.7.5] [production]" —
no auth restart loop. The misleading
"(recovery from failed initial apply)" suffix is gone.

Cast post-processed to strip OSC 3008 sequences (pam_systemd
hostname/machineid leak under sudo) and replace nix-store-path
SHELL env value, per decision_018-recording-ssh-delegation. Cast
duration 4.80 s, ≤ 90 s SC-005a budget.

T013 README walkthrough block updated to keep verbatim fidelity to
the new T012 capture: the second fenced block's header and underline
now match the corrected output (no "(recovery)" suffix; underline
length adjusted from 72 chars to 35 chars to match the shorter title).
Block count = 2 (FR-006), combined non-blank lines = 25 (SC-007b),
unique unit identifiers = 5 (SC-007).

The rebase dropped the prior `chore(release): bump 2.2.1 -> 2.2.2`
commit (subsumed by master's PR #34/#35/#36 promotes); spec/018 now
bumps `2.2.3 -> 2.2.4` per `packaged_readme_surface` carve-out.

Verification:

$ cargo test                                                                  473 passed
$ cargo clippy --all-targets -- -D warnings                                   clean
$ cargo run --bin core-ops-release -- validate --base-ref master              passed (patch)
$ wc -l README.md                                                             297       (≤ 400)            PASS  SC-001
$ <walkthrough section>: 2 fenced blocks, 25 non-blank lines, 5 unit IDs       PASS  FR-006/SC-007/SC-007b
$ head -n 1 docs/onboarding.cast | jq '.version'                              2                            PASS  SC-005
$ head -n 1 docs/onboarding.cast | jq '.duration'                             4.80      (≤ 90)             PASS  SC-005a
$ grep -iE '(not\.one|ulthar|192\.168\.|10\.0\.|172\.16\.)' docs/onboarding.cast docs/onboarding-script.sh
                                                                              (no matches)                 PASS  SC-006a
$ grep -c '3008' docs/onboarding.cast                                         0                            PASS  FR-009a
$ asciinema play docs/onboarding.cast                                         plays end-to-end             PASS

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
outergod added a commit that referenced this pull request May 7, 2026
… playback

Operator reported staircase distortion in the rendered cast/GIF: each
successive line started indented by the length of the previous line,
the classic "LF without CR" pattern.

Root cause: the inner recording script piped SSH output through
`| sed 's/\r$//'`, which stripped every `\r` that the remote PTY's
ONLCR translation had added. The captured cast events therefore
contained bare `\n` (line feed only). asciinema 2.4.0 plays casts
through a raw-mode local TTY (so it can faithfully replay any
escape sequences without double-translation), which means LF stays
LF on output — the terminal treats each LF as "down one row" with
no column reset, hence the staircase.

The original sed filter was probably defensive — line-buffered tools
sometimes choke on trailing `\r` — but for asciinema-capture the CRs
are essential. Drop the filter and re-record.

After re-recording the plan-output event has 124 CRs and 124 LFs
(matched), confirming `\r\n` line terminators throughout. `asciinema
cat docs/onboarding.cast` now renders left-aligned. The GIF rendered
from this cast (`docs/assets/core-ops-demo.gif`, 107 KB GIF89a)
displays the correct terminal layout inline on GitHub.

Cast re-recorded against the same fixed stack (PRs #34/#35/#36
merged):

  Beat 1 (plan):   Plan for host example @ (stateless) (first run)
  Beat 2 (apply):  Apply for host example @ (stateless) (first run)
                   Outcome: converged
  Beat 3 (replan): Plan for host example @ (stateless)
                   10 unchanged

Same OSC 3008 strip + SHELL sanitize post-processing applied (per
decision_018-recording-ssh-delegation). Duration 4.88 s, well under
SC-005a's 90 s budget.

Verification:

$ head -n 1 docs/onboarding.cast | jq '.version'                  2
$ head -n 1 docs/onboarding.cast | jq '.duration'                 4.88012  (≤ 90)
$ grep -c '3008' docs/onboarding.cast                             0
$ grep -iE '(not\.one|ulthar|192\.168\.|10\.0\.|172\.16\.)' docs/onboarding.cast docs/onboarding-script.sh
                                                                  (no matches)
$ asciinema cat docs/onboarding.cast | head -3                    left-aligned, no staircase
$ wc -c < docs/assets/core-ops-demo.gif                           106777  (≤ 1 MB)

The non-tracked recording driver `/tmp/onboarding-inner.sh` carried
the buggy sed pipe in this session; the canonical
`docs/onboarding-script.sh` does NOT have a sed filter (its inner
`demo()` runs commands directly via `eval`, not through SSH
delegation). The bug therefore lives only in the SSH-delegation
recording procedure documented in
decision_018-recording-ssh-delegation, not in the canonical
regeneration script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
outergod added a commit that referenced this pull request May 7, 2026


Rebases onto post-promote master at v2.2.3 (which now contains both
spec/018-blocking fixes: PR #35 wired the immich-db-password Podman
secret into immich-server.container, and PR #36 introduced
ApplyRunDisplayState::Stateless so stateless re-plans no longer
flag a healthy host as "(recovery from failed initial apply)").

Re-recording on `core-ops-uat` (Fedora CoreOS guest) against the
fixed binary + fixed example produces the truthful narrative the
spec/018 walkthrough is supposed to demonstrate:

  Beat 1 (plan):   Plan for host example @ (stateless) (first run)
  Beat 2 (apply):  Apply for host example @ (stateless) (first run)
                   Outcome: converged
  Beat 3 (replan): Plan for host example @ (stateless)
                   10 unchanged

immich-server reaches `active running` (NRestarts=1) and journal
shows "Immich Microservices is running [v2.7.5] [production]" —
no auth restart loop. The misleading
"(recovery from failed initial apply)" suffix is gone.

Cast post-processed to strip OSC 3008 sequences (pam_systemd
hostname/machineid leak under sudo) and replace nix-store-path
SHELL env value, per decision_018-recording-ssh-delegation. Cast
duration 4.80 s, ≤ 90 s SC-005a budget.

T013 README walkthrough block updated to keep verbatim fidelity to
the new T012 capture: the second fenced block's header and underline
now match the corrected output (no "(recovery)" suffix; underline
length adjusted from 72 chars to 35 chars to match the shorter title).
Block count = 2 (FR-006), combined non-blank lines = 25 (SC-007b),
unique unit identifiers = 5 (SC-007).

The rebase dropped the prior `chore(release): bump 2.2.1 -> 2.2.2`
commit (subsumed by master's PR #34/#35/#36 promotes); spec/018 now
bumps `2.2.3 -> 2.2.4` per `packaged_readme_surface` carve-out.

Verification:

$ cargo test                                                                  473 passed
$ cargo clippy --all-targets -- -D warnings                                   clean
$ cargo run --bin core-ops-release -- validate --base-ref master              passed (patch)
$ wc -l README.md                                                             297       (≤ 400)            PASS  SC-001
$ <walkthrough section>: 2 fenced blocks, 25 non-blank lines, 5 unit IDs       PASS  FR-006/SC-007/SC-007b
$ head -n 1 docs/onboarding.cast | jq '.version'                              2                            PASS  SC-005
$ head -n 1 docs/onboarding.cast | jq '.duration'                             4.80      (≤ 90)             PASS  SC-005a
$ grep -iE '(not\.one|ulthar|192\.168\.|10\.0\.|172\.16\.)' docs/onboarding.cast docs/onboarding-script.sh
                                                                              (no matches)                 PASS  SC-006a
$ grep -c '3008' docs/onboarding.cast                                         0                            PASS  FR-009a
$ asciinema play docs/onboarding.cast                                         plays end-to-end             PASS

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
outergod added a commit that referenced this pull request May 7, 2026
… playback

Operator reported staircase distortion in the rendered cast/GIF: each
successive line started indented by the length of the previous line,
the classic "LF without CR" pattern.

Root cause: the inner recording script piped SSH output through
`| sed 's/\r$//'`, which stripped every `\r` that the remote PTY's
ONLCR translation had added. The captured cast events therefore
contained bare `\n` (line feed only). asciinema 2.4.0 plays casts
through a raw-mode local TTY (so it can faithfully replay any
escape sequences without double-translation), which means LF stays
LF on output — the terminal treats each LF as "down one row" with
no column reset, hence the staircase.

The original sed filter was probably defensive — line-buffered tools
sometimes choke on trailing `\r` — but for asciinema-capture the CRs
are essential. Drop the filter and re-record.

After re-recording the plan-output event has 124 CRs and 124 LFs
(matched), confirming `\r\n` line terminators throughout. `asciinema
cat docs/onboarding.cast` now renders left-aligned. The GIF rendered
from this cast (`docs/assets/core-ops-demo.gif`, 107 KB GIF89a)
displays the correct terminal layout inline on GitHub.

Cast re-recorded against the same fixed stack (PRs #34/#35/#36
merged):

  Beat 1 (plan):   Plan for host example @ (stateless) (first run)
  Beat 2 (apply):  Apply for host example @ (stateless) (first run)
                   Outcome: converged
  Beat 3 (replan): Plan for host example @ (stateless)
                   10 unchanged

Same OSC 3008 strip + SHELL sanitize post-processing applied (per
decision_018-recording-ssh-delegation). Duration 4.88 s, well under
SC-005a's 90 s budget.

Verification:

$ head -n 1 docs/onboarding.cast | jq '.version'                  2
$ head -n 1 docs/onboarding.cast | jq '.duration'                 4.88012  (≤ 90)
$ grep -c '3008' docs/onboarding.cast                             0
$ grep -iE '(not\.one|ulthar|192\.168\.|10\.0\.|172\.16\.)' docs/onboarding.cast docs/onboarding-script.sh
                                                                  (no matches)
$ asciinema cat docs/onboarding.cast | head -3                    left-aligned, no staircase
$ wc -c < docs/assets/core-ops-demo.gif                           106777  (≤ 1 MB)

The non-tracked recording driver `/tmp/onboarding-inner.sh` carried
the buggy sed pipe in this session; the canonical
`docs/onboarding-script.sh` does NOT have a sed filter (its inner
`demo()` runs commands directly via `eval`, not through SSH
delegation). The bug therefore lives only in the SSH-delegation
recording procedure documented in
decision_018-recording-ssh-delegation, not in the canonical
regeneration script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant