Skip to content

refactor(onboard): run live sequence with record-only steps#4472

Merged
cv merged 30 commits into
mainfrom
stack/onboard-fsm-live-record-only-sequence
Jun 8, 2026
Merged

refactor(onboard): run live sequence with record-only steps#4472
cv merged 30 commits into
mainfrom
stack/onboard-fsm-live-record-only-sequence

Conversation

@cv

@cv cv commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Switch the live manual onboarding sequence to record-only step helpers while preserving compatibility for resume/ahead states. Machine transitions now come from returned FSM results rather than implicit step-helper movement.

Changes

  • Configure the live OnboardRuntimeBoundary with stepMutationOptions: { updateMachine: false }.
  • Explicitly enter preflight from init before invoking the first handler when applicable.
  • Apply ordered provider/inference FSM results through the compatibility boundary instead of only the final result.
  • Keep stale/ahead result no-op behavior for existing resume flows and tests.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Carlos Villela cvillela@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • More reliable session resumption that restores onboarding progress and correctly handles previously failed steps.
    • Prevented incorrect advancement for nonterminal onboarding states during resume.
  • New Features

    • Automatic snapshot repair/normalization when resuming interrupted sessions for more consistent recovery.
    • Initial onboarding preflight state is now explicitly recorded to improve resume behavior.
  • Tests

    • Added tests validating resume/repair scenarios and successful boundary replay from failed steps.

cv added 21 commits May 27, 2026 15:18
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@cv cv self-assigned this May 28, 2026
@copy-pr-bot

copy-pr-bot Bot commented May 28, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b7fddd6d-7dfc-4adc-9bfa-f3df611dbf53

📥 Commits

Reviewing files that changed from the base of the PR and between 5285792 and e0e3162.

📒 Files selected for processing (1)
  • src/lib/onboard/resume-machine-repair.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/resume-machine-repair.test.ts

📝 Walkthrough

Walkthrough

Adds resume-machine repair helpers and tests, integrates repair into onboarding resume initialization, configures the runtime boundary to avoid automatic machine updates, and explicitly records an initial preflight state.

Changes

Resume-Machine Repair and Onboarding Integration

Layer / File(s) Summary
Onboarding runtime wiring and configuration
src/lib/onboard.ts
Imports repairResumeMachineSnapshot and advanceTo, sets OnboardRuntimeBoundary stepMutationOptions: { updateMachine: false }, calls repairResumeMachineSnapshot during --resume session setup, and records an initial preflight state result via recordStateResult(advanceTo(..., { state: "init" })). Also includes minor import/whitespace formatting adjustments.
Resume-machine-repair test suite
src/lib/onboard/resume-machine-repair.test.ts, src/lib/onboard/resume-machine-repair.ts
Adds test helpers and a harness plus tests that validate resumeMachineState derivation and repairResumeMachineSnapshot behavior across failed, completed, and nonterminal snapshot states, and verifies record-only resume sequences complete and clear failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/NemoClaw#4457: Related use of stepMutationOptions/updateMachine: false to enable record-only boundary step behavior.
  • NVIDIA/NemoClaw#4376: Related changes to onboarding result/advance flow and runtime boundary interactions.

Suggested labels

refactor, onboarding, feature

Suggested reviewers

  • prekshivyas
  • cjagwani
  • ericksoa

Poem

🐇 I nudge the snapshot back to bright,
A tiny stitch in resume's night.
Preflight marked, the revision climbs,
Failures cleared in careful times.
Hop on—onward, step by step.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'refactor(onboard): run live sequence with record-only steps' directly aligns with the main changes in the PR, which involve switching the live onboarding sequence to use record-only step helpers and configuring the boundary with stepMutationOptions to prevent machine mutation.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch stack/onboard-fsm-live-record-only-sequence

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: onboard-resume-e2e, onboard-repair-e2e
Optional E2E: cloud-e2e, ubuntu-repo-cloud-openclaw-resume

Dispatch hint: onboard-resume-e2e,onboard-repair-e2e

Auto-dispatched E2E: onboard-resume-e2e, onboard-repair-e2e via nightly-e2e.yaml at e0e31628676f9d36d9fb87d00c5a2dbb20f25a07nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • onboard-resume-e2e (medium; live Docker/OpenShell sandbox with NVIDIA API key, timeout 60 minutes): Focused existing E2E for interrupted onboard followed by nemoclaw onboard --resume --non-interactive; directly exercises the resume path modified by the PR.
  • onboard-repair-e2e (medium; live Docker/OpenShell sandbox with NVIDIA API key, timeout 60 minutes): Focused existing E2E for resume repair and invalidation behavior, including stale recorded sandbox recovery and resume conflict handling; relevant because the PR changes failed-session machine snapshot repair before resumed lifecycle steps run.

Optional E2E

  • cloud-e2e (medium; full live cloud onboarding path with NVIDIA API key): Useful broader confidence that the normal install → onboard → sandbox verify → live inference user journey still completes after the runtime-boundary and initial machine-transition changes in src/lib/onboard.ts. Not as targeted as the resume/repair jobs.
  • ubuntu-repo-cloud-openclaw-resume (medium; scenario runner with live NVIDIA API key): Typed scenario coverage for the OpenClaw cloud resume journey, if scenario E2E is available for this PR lane; complements the shell-based onboard resume E2E.

New E2E recommendations

  • onboarding_resume_state_machine (medium): Existing shell E2Es cover policy-step interruption and repair flows, but this PR specifically repairs durable machine.state === "failed" snapshots back to earlier active states such as preflight, gateway, or inference. Add a focused E2E that seeds or induces a failed durable FSM snapshot at an early onboarding state, runs nemoclaw onboard --resume, and asserts the live run resumes from the repaired state and completes without skipping required lifecycle work.
    • Suggested test: Add a focused resume-machine-snapshot-repair E2E for failed durable FSM snapshots at preflight/gateway/inference.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: onboard-resume-e2e,onboard-repair-e2e

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw-resume
Optional scenario E2E: ubuntu-repo-cloud-openclaw

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-resume

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • ubuntu-repo-cloud-openclaw-resume: Changes add and wire resume-machine snapshot repair into onboard --resume; this is directly exercised by the cloud OpenClaw resume-after-interrupt scenario.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-resume

Optional scenario E2E

  • ubuntu-repo-cloud-openclaw: Adjacent baseline onboarding coverage for the shared onboard.ts runtime-boundary/state-transition changes outside the explicit resume path.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Relevant changed files

  • src/lib/onboard.ts
  • src/lib/onboard/resume-machine-repair.ts

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 0 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 0 still apply, 0 new items found

Consider writing more tests for
  • **Runtime validation** — `--resume` from failed `agent_setup` for a non-OpenClaw agent repairs to `agent_setup`, skips stale earlier results, then applies `agent_setup -> policies -> finalizing -> post_verify -> complete`.. The new unit/boundary tests cover the helper and important failed preflight/gateway/inference paths. Because the changed behavior lives in onboarding host glue where durable session repair, compatibility skipping, provider/inference result ordering, sandbox branching, policies, finalization, and exit-time failure backstops interact, targeted runtime or integration validation would further reduce regression risk.
  • **Runtime validation** — `--resume` from failed `policies` repairs to `policies`, revalidates or reapplies policy presets as needed, and only completes after finalization verification.. The new unit/boundary tests cover the helper and important failed preflight/gateway/inference paths. Because the changed behavior lives in onboarding host glue where durable session repair, compatibility skipping, provider/inference result ordering, sandbox branching, policies, finalization, and exit-time failure backstops interact, targeted runtime or integration validation would further reduce regression risk.
  • **Runtime validation** — `--resume` after sandbox completion with an agent change clears agent-scoped state, does not reuse stale `openclaw` or `agent_setup` branch state, and restarts provider/sandbox alignment safely.. The new unit/boundary tests cover the helper and important failed preflight/gateway/inference paths. Because the changed behavior lives in onboarding host glue where durable session repair, compatibility skipping, provider/inference result ordering, sandbox branching, policies, finalization, and exit-time failure backstops interact, targeted runtime or integration validation would further reduce regression risk.
  • **Runtime validation** — A higher-level onboard runtime harness starts from a failed durable session and verifies the real handler composition skips the initial `init -> preflight` result instead of throwing for repaired gateway, inference, agent, and policy states.. The new unit/boundary tests cover the helper and important failed preflight/gateway/inference paths. Because the changed behavior lives in onboarding host glue where durable session repair, compatibility skipping, provider/inference result ordering, sandbox branching, policies, finalization, and exit-time failure backstops interact, targeted runtime or integration validation would further reduce regression risk.
  • **Runtime validation** — A non-OpenClaw `agent_setup` failure under record-only step mutation is persisted as terminal failed by the exit/failure backstop rather than remaining only as an in-progress session with a failed step.. The new unit/boundary tests cover the helper and important failed preflight/gateway/inference paths. Because the changed behavior lives in onboarding host glue where durable session repair, compatibility skipping, provider/inference result ordering, sandbox branching, policies, finalization, and exit-time failure backstops interact, targeted runtime or integration validation would further reduce regression risk.
  • **Acceptance clause:** `npx prek run --all-files` passes — add test evidence or identify existing coverage. Claimed in the PR body, but this read-only advisory review did not execute commands or verify external run status.
  • **Acceptance clause:** `npm test` passes — add test evidence or identify existing coverage. Claimed in the PR body, but this read-only advisory review did not execute commands or verify external run status.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@cv cv added the onboarding label May 29, 2026
@wscurran wscurran added the area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow label Jun 3, 2026
@cv cv added the v0.0.61 Release target label Jun 5, 2026
cv added 2 commits June 5, 2026 15:42
Resolve PR #4471 against current main while preserving the ordered
provider/inference FSM result sequence.

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Resolve PR #4472 against the updated provider result sequence base while
keeping live onboarding step mutations record-only.

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Base automatically changed from stack/onboard-fsm-provider-result-sequence to main June 7, 2026 21:14
Bring PR #4472 current with origin/main.

Repair failed resume sessions.

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Comment thread src/lib/onboard.ts Fixed
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@cv cv marked this pull request as ready for review June 7, 2026 21:40
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27105680547
Target ref: 11440c45d73c3973778a94d1b9122facb5c81a7b
Workflow ref: main
Requested jobs: onboard-resume-e2e,onboard-repair-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
onboard-repair-e2e ⚠️ cancelled
onboard-resume-e2e ⚠️ cancelled

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27105755137
Target ref: f5c0a22f044987e42d028e2e8bf50d82738a980b
Workflow ref: main
Requested jobs: onboard-resume-e2e,onboard-repair-e2e
Summary: 1 passed, 1 failed, 0 skipped

Job Result
onboard-repair-e2e ✅ success
onboard-resume-e2e ❌ failure

Failed jobs: onboard-resume-e2e. Check run artifacts for logs.

@cv

cv commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27107451958
Target ref: e0e31628676f9d36d9fb87d00c5a2dbb20f25a07
Workflow ref: main
Requested jobs: onboard-resume-e2e,onboard-repair-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success

@cv

cv commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cv cv merged commit 0a5327b into main Jun 8, 2026
28 checks passed
@cv cv deleted the stack/onboard-fsm-live-record-only-sequence branch June 8, 2026 00:54
cv added a commit that referenced this pull request Jun 8, 2026
<!-- markdownlint-disable MD041 -->
## Summary
Repairs reopened terminal onboarding machine snapshots before the
record-only resume path replays compatibility phases. It also makes
deployment messaging verification check sandbox-scoped bridge provider
names and skip tokenless WhatsApp provider checks, removing the
misleading missing-provider warning from rebuild logs.

## Related Issue
Refs #4533
Context: #4472

## Changes
- Repair `complete` machine snapshots that have been reopened for
resume/rebuild using legacy step state, while leaving non-resumable
completed sessions untouched.
- Add record-only resume regression coverage for reopened complete
snapshots.
- Verify messaging providers with sandbox-scoped bridge provider names
and skip provider checks for tokenless WhatsApp.
- Add deployment verification tests for channel-to-provider mapping and
tokenless channels.

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `npm run docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
<!-- DCO sign-off required by CI. Run: git config user.name && git
config user.email -->
Signed-off-by: Carlos Villela <cvillela@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved session recovery to repair certain completed workflow
snapshots while leaving non-resumable completed snapshots unchanged
* More accurate messaging provider verification that maps messaging
channels to expected gateway providers and skips tokenless channels
(e.g., WhatsApp)

* **Tests**
* Added coverage for session repair edge cases, including resumable vs
non-resumable complete snapshots and record-only resume scenarios
* Expanded messaging provider verification tests, including tokenless
channel handling
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@wscurran wscurran added the refactor PR restructures code without intended behavior change label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow refactor PR restructures code without intended behavior change v0.0.61 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants