Skip to content

fix(onboard): repair rebuilt resume snapshots#4938

Merged
cv merged 2 commits into
mainfrom
codex/fix-rebuild-resume-fsm
Jun 8, 2026
Merged

fix(onboard): repair rebuilt resume snapshots#4938
cv merged 2 commits into
mainfrom
codex/fix-rebuild-resume-fsm

Conversation

@cv

@cv cv commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

Repairs reopened terminal onboarding machine snapshots before the record-only resume path replays compatibility phases. It also makes deployment messaging verification check sandbox-scoped bridge provider names and skip tokenless WhatsApp provider checks, removing the misleading missing-provider warning from rebuild logs.

Related Issue

Refs #4533
Context: #4472

Changes

  • Repair complete machine snapshots that have been reopened for resume/rebuild using legacy step state, while leaving non-resumable completed sessions untouched.
  • Add record-only resume regression coverage for reopened complete snapshots.
  • Verify messaging providers with sandbox-scoped bridge provider names and skip provider checks for tokenless WhatsApp.
  • Add deployment verification tests for channel-to-provider mapping and tokenless channels.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Carlos Villela cvillela@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Improved session recovery to repair certain completed workflow snapshots while leaving non-resumable completed snapshots unchanged
    • More accurate messaging provider verification that maps messaging channels to expected gateway providers and skips tokenless channels (e.g., WhatsApp)
  • Tests

    • Added coverage for session repair edge cases, including resumable vs non-resumable complete snapshots and record-only resume scenarios
    • Expanded messaging provider verification tests, including tokenless channel handling

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@cv cv self-assigned this Jun 8, 2026
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fc3becff-c96a-4497-a903-03d3788eca4f

📥 Commits

Reviewing files that changed from the base of the PR and between 32fe631 and 9e7b8e4.

📒 Files selected for processing (2)
  • src/lib/verify-deployment-messaging.test.ts
  • src/lib/verify-deployment.test.ts
💤 Files with no reviewable changes (1)
  • src/lib/verify-deployment.test.ts

📝 Walkthrough

Walkthrough

This PR: (1) extends terminal-machine snapshot repair to also handle certain resumable "complete" sessions via a new predicate and tests; (2) changes messaging verification to derive expected gateway provider names per channel and skip checks for tokenless channels like whatsapp, with new tests.

Changes

Resume Machine Repair for Complete Sessions

Layer / File(s) Summary
Complete state repair and resumable session handling
src/lib/onboard/resume-machine-repair.ts, src/lib/onboard/resume-machine-repair.test.ts
Introduce shouldRepairTerminalMachineSnapshot (repair when durable machine is failed or complete but session remains resumable). repairResumeMachineSnapshot uses this predicate to decide early return and may advance the machine from lastCompletedStep. Tests added for resumable vs non-resumable complete snapshots and record-only resume scenarios.

Messaging Provider Verification with Derived Provider Names

Layer / File(s) Summary
Provider name derivation and tokenless channel allowlist
src/lib/verify-deployment.ts
Import getMessagingProviderNamesForChannel, add TOKENLESS_MESSAGING_CHANNELS (includes "whatsapp"), and update verification to derive expected gateway provider names per messaging channel and verify each; skip check when channel is tokenless and yields no provider names.
Messaging provider derivation and tokenless channel test coverage
src/lib/verify-deployment-messaging.test.ts, src/lib/verify-deployment.test.ts
Add new tests and adjust mocks to assert sandbox-scoped provider checks for configured channels, warn behavior when Slack providers are missing, and that tokenless channels (WhatsApp) skip gateway checks and emit no messaging diagnostics.

Sequence Diagram(s)

No additional sequence diagrams generated beyond the hidden artifact.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4472: Both PRs modify repairResumeMachineSnapshot behavior for onboarding session snapshot repairs.

Suggested labels

fix, onboarding, area: messaging

Suggested reviewers

  • ericksoa
  • jyaunches
  • prekshivyas
  • sandl99

Poem

🐰 I hop through snapshots, patch and mend,
Resumable completes I gently send,
I map the bridges, skip whatsapp's gate,
Tests hum softly, the checks look great. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main objective of the PR: fixing resume snapshot repair for rebuilt (reopened) terminal states in the onboarding machine, which is the primary change across the modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/fix-rebuild-resume-fsm

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: onboard-resume-e2e, onboard-repair-e2e, messaging-providers-e2e
Optional E2E: cloud-onboard-e2e, device-auth-health-e2e

Dispatch hint: onboard-resume-e2e,onboard-repair-e2e,messaging-providers-e2e

Auto-dispatched E2E: onboard-resume-e2e, onboard-repair-e2e, messaging-providers-e2e via nightly-e2e.yaml at 9e7b8e44e24baabcd41397ada68ff4693db013adnightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • onboard-resume-e2e (medium; up to 60 minutes): Directly exercises interrupted onboard state, durable session data, nemoclaw onboard --resume --non-interactive, and completion after resume; required because the PR changes resume-machine terminal snapshot repair semantics.
  • onboard-repair-e2e (medium; up to 60 minutes): Covers resume repair behavior for stale recorded sandbox state and resume invalidation paths; required because the modified repair bridge determines where a terminal session is resumed from.
  • messaging-providers-e2e (medium/high; about 15 minutes nominal, workflow timeout 75 minutes): Validates real Telegram/Discord/Slack provider creation, sandbox-scoped provider names, credential isolation, OpenClaw config/runtime channel discovery, and WhatsApp no-provider behavior; required because verifyDeployment messaging provider checks now depend on those provider-name and tokenless-channel rules.

Optional E2E

  • cloud-onboard-e2e (medium): Useful broad confidence that normal OpenClaw onboarding still reaches post-deployment verification and declares the sandbox live after changes to verify-deployment.ts, but not strictly merge-blocking beyond the targeted messaging verification E2E.
  • device-auth-health-e2e (medium; workflow timeout 30 minutes): Optional regression coverage for verifyDeployment health messaging and gateway/dashboard probing in a real onboarded sandbox, including the device-auth 401-alive path adjacent to the touched verification code.

New E2E recommendations

  • onboarding-resume-state-machine (high): Existing resume E2Es cover failed/interrupted sessions, but this PR specifically adds repair for an in-progress, resumable session whose durable machine snapshot is already complete after rebuild. Add a scenario that creates or edits a real session into that reopened-complete state and verifies onboard --resume restarts from the next step and completes.
    • Suggested test: Add a targeted reopened-complete-snapshot resume E2E, e.g. test/e2e/test-onboard-resume-reopened-complete.sh, and expose it as a nightly selective job.
  • messaging-provider-verification (medium): messaging-providers-e2e proves providers exist and WhatsApp has no provider, but it does not appear to force a missing sandbox-scoped Slack bridge/app provider and assert verifyDeployment emits a warning while keeping the deployment non-blocking.
    • Suggested test: Add a messaging verification negative-path E2E that configures Slack, removes or hides one expected provider, runs deployment diagnostics/verification, and asserts a messaging warning rather than an unhealthy deployment.

Dispatch hint

  • Workflow: E2E / Nightly
  • jobs input: onboard-resume-e2e,onboard-repair-e2e,messaging-providers-e2e

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw-resume, ubuntu-repo-cloud-openclaw-slack, ubuntu-repo-cloud-openclaw-discord
Optional scenario E2E: ubuntu-repo-cloud-openclaw-telegram, ubuntu-repo-cloud-hermes-slack

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-resume
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-slack
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-discord

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • ubuntu-repo-cloud-openclaw-resume: Changes resume terminal snapshot repair used by --resume onboarding; this scenario exercises the cloud OpenClaw resume-after-interrupt onboarding lifecycle and validates the resumed sandbox reaches smoke-ready state.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-resume
  • ubuntu-repo-cloud-openclaw-slack: verify-deployment messaging provider checks now resolve sandbox-scoped provider names, with Slack requiring both bridge and app providers; this is the primary scenario for the changed multi-provider messaging path.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-slack
  • ubuntu-repo-cloud-openclaw-discord: verify-deployment changed generic messaging provider existence checks from channel names to sandbox-scoped provider names; Discord covers the single-provider non-Slack branch on OpenClaw.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-discord

Optional scenario E2E

  • ubuntu-repo-cloud-openclaw-telegram: Adjacent OpenClaw messaging scenario exercising the same provider-resolution path for Telegram; useful if extra confidence is desired, but Slack and Discord cover the main changed branches.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-telegram
  • ubuntu-repo-cloud-hermes-slack: Adjacent Slack messaging coverage on the Hermes onboarding profile; optional because the primary changed deployment verification and provider-name behavior is covered by the OpenClaw Slack scenario.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-hermes-slack

Relevant changed files

  • src/lib/onboard/resume-machine-repair.ts
  • src/lib/verify-deployment.ts

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 0 worth checking, 0 nice ideas
Since last review: 1 prior item resolved, 0 still apply, 0 new items found

Consider writing more tests for
  • **Runtime validation** — Validate a rebuild-created session with `machine.state: "complete"`, `status: "in_progress"`, `resumable: true`, and a recorded `lastCompletedStep` is repaired before the record-only runner applies the next FSM transition.. Unit coverage is solid for the changed logic, but both production paths touch runtime/sandbox boundaries: rebuild-driven resume repair and deployment verification diagnostics.
  • **Runtime validation** — Verify `verifyDeployment` reports WhatsApp runtime/config missing warnings when `probeChannelRuntimeStatus` omits `whatsapp`, even though provider lookup is skipped for the tokenless channel.. Unit coverage is solid for the changed logic, but both production paths touch runtime/sandbox boundaries: rebuild-driven resume repair and deployment verification diagnostics.
  • **Runtime validation** — Clarify with a test whether an inconsistent completed session with `status: "complete"` but `resumable: true` should be repaired or rejected.. Unit coverage is solid for the changed logic, but both production paths touch runtime/sandbox boundaries: rebuild-driven resume repair and deployment verification diagnostics.
  • **Acceptance clause:** Trusted linked issue clauses for Track onboard FSM migration stack and resume compatibility #4533 / refactor(onboard): run live sequence with record-only steps #4472 — add test evidence or identify existing coverage. The deterministic GitHub context provided `linkedIssues: []`, so no trusted linked issue body/comment clauses were available to extract literally or verify. PR body references were treated as untrusted context only.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27116903665
Target ref: 9e7b8e44e24baabcd41397ada68ff4693db013ad
Workflow ref: main
Requested jobs: onboard-resume-e2e,rebuild-openclaw-e2e,messaging-providers-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
messaging-providers-e2e ⚠️ cancelled
onboard-resume-e2e ⚠️ cancelled
rebuild-openclaw-e2e ⚠️ cancelled

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27117026620
Target ref: 9e7b8e44e24baabcd41397ada68ff4693db013ad
Workflow ref: main
Requested jobs: onboard-resume-e2e,onboard-repair-e2e,messaging-providers-e2e
Summary: 3 passed, 0 failed, 0 skipped

Job Result
messaging-providers-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27117531320
Target ref: 9e7b8e44e24baabcd41397ada68ff4693db013ad
Workflow ref: codex/fix-rebuild-resume-fsm
Requested jobs: rebuild-openclaw-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
rebuild-openclaw-e2e ✅ success

@cv cv merged commit f9526a2 into main Jun 8, 2026
106 checks passed
@cv cv deleted the codex/fix-rebuild-resume-fsm branch June 8, 2026 05:33
@cv cv added the v0.0.61 Release target label Jun 8, 2026
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression v0.0.61 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants