Skip to content

refactor(onboard): extract provider selection resolver#5173

Merged
cv merged 9 commits into
mainfrom
codex/onboard-provider-selection-flow
Jun 11, 2026
Merged

refactor(onboard): extract provider selection resolver#5173
cv merged 9 commits into
mainfrom
codex/onboard-provider-selection-flow

Conversation

@cv

@cv cv commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Extracts non-interactive and resume provider selection out of setupNim into a focused resolver so the onboarding flow can keep shrinking without changing provider behavior. The resolver returns structured success/failure outcomes while setupNim preserves the existing user-facing error messages.

Related Issue

Refs #3802

Changes

  • Add src/lib/onboard/provider-selection.ts for requested-provider fallback, recorded-provider recovery, and provider-selection failure classification.
  • Add unit coverage for recovered providers, Windows-host Ollama edge cases, Hermes Provider gating, install-key fallback, and default cloud selection.
  • Replace the inline provider recovery block in setupNim with a resolver call plus existing message rendering.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Carlos Villela cvillela@nvidia.com

Summary by CodeRabbit

  • New Features

    • Enhanced provider selection with improved error recovery and targeted error messages for unavailable providers.
  • Bug Fixes

    • Better handling of provider availability conflicts and edge cases during onboarding.
  • Tests

    • Added comprehensive test coverage for provider selection resolver.

@cv cv added refactor PR restructures code without intended behavior change area: cli Command line interface, flags, terminal UX, or output area: architecture Architecture, design debt, major refactors, or maintainability labels Jun 10, 2026
@cv cv self-assigned this Jun 10, 2026
@cv cv added area: cli Command line interface, flags, terminal UX, or output area: architecture Architecture, design debt, major refactors, or maintainability labels Jun 10, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2da050c0-1705-44c4-b1bb-1e9039e99685

📥 Commits

Reviewing files that changed from the base of the PR and between 48bd12a and efc7ebe.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/onboard/provider-selection.test.ts
  • src/lib/onboard/provider-selection.ts

📝 Walkthrough

Walkthrough

This PR extracts provider-selection resolution logic into a new provider-selection module with structured success/failure types, comprehensive test coverage validating all resolution paths (recorded recovery, fallback, OS constraints, Hermes gating), and integrates the resolver into the onboarding inference flow to replace inline logic with targeted error messaging.

Changes

Provider Selection Resolver

Layer / File(s) Summary
Provider-selection resolver contract and implementation
src/lib/onboard/provider-selection.ts
Exports ProviderSelectionResolution, ProviderSelectionFailureReason, ProviderSelectionSuccess, ProviderSelectionRecoveryReaders, and ResolveRequestedProviderSelectionInput types. Implements resolveRequestedProviderSelection, which recovers recorded provider/model when no request is provided, validates Windows-host-Ollama support constraints for targeted requests, falls back via resolveProviderKeyFallback, and gates Hermes selection by availability. Returns structured failure or success with selected option and recovery metadata.
Provider-selection resolver test suite
src/lib/onboard/provider-selection.test.ts
Vitest suite with test fixture setup (option() helper, remoteProviderConfig) and resolve() wrapper. Cases validate: fallback from install-* actions, recorded-provider and model recovery, WSL+recorded-Ollama vs Windows-host-Ollama conflict detection, unavailable recorded Ollama with Windows-host hint messaging, Hermes unavailability when requested for another agent, unsupported Windows-host-Ollama request rejection, and default build selection when no provider is requested or recorded.
Onboarding provider-selection integration
src/lib/onboard.ts
Imports resolveRequestedProviderSelection and refactors inference provider selection. When non-interactive or with a pre-requested provider, delegates to the resolver; handles failure reason kinds with targeted console error/hints (e.g., Windows-host Ollama hints, Hermes-only availability messaging) then exits; on success, assigns selected and recovery flags (recoveredFromSandbox, recoveredModel) to seed downstream model/provider prompt defaults and labeling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#5171: Both PRs coordinate WSL/windows-host Ollama availability logic in the onboarding provider selection flow; this PR extracts selection into a resolver while the referenced PR extracts host-state detection utilities.

Suggested reviewers

  • prekshivyas

Poem

🐰 A resolver hops in, tests tucked neat,
Provider selection now complete,
Ollama on WSL, Windows too—
Recovery and fallback, tested through and through! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: extracting provider selection resolver logic into a dedicated module from the setupNim function.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/onboard-provider-selection-flow

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: cloud-onboard-e2e, onboard-negative-paths-e2e
Optional E2E: onboard-resume-e2e, gpu-e2e, wsl-e2e

Dispatch hint: cloud-onboard-e2e,onboard-negative-paths-e2e

Auto-dispatched E2E: cloud-onboard-e2e, onboard-negative-paths-e2e via nightly-e2e.yaml at efc7ebe3ac5cfa436fdcfd253f257003d1d341ebnightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • cloud-onboard-e2e (medium; live Docker sandbox plus NVIDIA endpoint access, default reusable-script timeout 45 minutes): Runs the real non-interactive install/onboard flow with NEMOCLAW_PROVIDER=cloud, creates a sandbox, and validates inference.local and sandbox health. This is the highest-signal existing E2E for changes in onboard provider selection and requested-provider handling.
  • onboard-negative-paths-e2e (medium; installs NemoClaw once and runs multiple onboard edge cases, timeout 75 minutes): Exercises onboard edge and negative paths, including non-interactive provider/model environment handling, invalid key rejection, gateway port conflict messaging, and NEMOCLAW_PROVIDER=cloud/NEMOCLAW_MODEL being honored. This directly guards the refactored failure branches and requested-provider unavailable behavior.

Optional E2E

  • onboard-resume-e2e (medium; live sandbox resume flow, timeout 60 minutes): Useful adjacent coverage because provider-selection now owns recorded provider/model recovery. This E2E validates interrupted onboard resume and completed provider-selection session state, but it may not directly exercise every recovered-provider branch added in this refactor.
  • gpu-e2e (high; requires GPU runner and local Ollama model pull, timeout 30 minutes): Optional confidence for local Ollama provider selection on a real GPU runner. The PR touches Ollama fallback and availability logic, but the mandatory cloud/negative tests cover the lower-cost critical requested-provider paths.
  • wsl-e2e (high; Windows runner with WSL setup and Docker-dependent full E2E, timeout 90 minutes): Optional platform-specific confidence for the WSL onboarding environment, relevant to the WSL recorded-Ollama versus Windows-host Ollama branch preserved by the refactor.

New E2E recommendations

  • WSL/Windows-host Ollama provider recovery (high): Existing E2E coverage does not appear to seed a sandbox recorded as ollama-local under WSL and then run onboard where only Windows-host Ollama is reachable, which is a key fail-loud branch in this refactor.
    • Suggested test: Add a hermetic or WSL-targeted onboard E2E that seeds recorded provider metadata as ollama-local, simulates Windows-host Ollama availability, and asserts the CLI exits with the explicit WSL Ollama hint instead of silently selecting Windows-host Ollama.
  • recorded provider/model recovery without explicit NEMOCLAW_PROVIDER (medium): The new helper centralizes recovery of recorded provider and model when no provider is requested, but existing live E2Es mostly pass NEMOCLAW_PROVIDER explicitly or validate resume state indirectly.
    • Suggested test: Add an E2E that onboards a sandbox with a non-default provider/model, clears NEMOCLAW_PROVIDER/NEMOCLAW_MODEL, re-runs non-interactive onboard against the same sandbox, and asserts the recorded provider/model are reused or fail loudly when unavailable.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: cloud-onboard-e2e,onboard-negative-paths-e2e

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: ubuntu-repo-cloud-openclaw
Optional Vitest E2E scenarios: ubuntu-repo-docker-post-reboot-recovery

Dispatch required Vitest E2E scenarios:

  • gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required Vitest E2E scenarios

  • ubuntu-repo-cloud-openclaw: The PR refactors non-interactive onboarding provider selection in src/lib/onboard.ts into src/lib/onboard/provider-selection.ts. The live-supported ubuntu-repo-cloud-openclaw scenario exercises the default CI onboarding path with NEMOCLAW_PROVIDER=cloud/NVIDIA endpoints and verifies the provider selection path still produces a working sandbox, inference, and credentials state.
    • Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional Vitest E2E scenarios

  • ubuntu-repo-docker-post-reboot-recovery: Optional adjacent coverage: this live-supported scenario uses the same cloud OpenClaw onboarding/provider-selection surface, then exercises the post-reboot recovery lifecycle. It is useful if reviewers want an extra guard that the refactor does not affect onboarded sandbox state before recovery, but it is not the smallest primary target for the provider-selection change.
    • Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-docker-post-reboot-recovery

Relevant changed files

  • src/lib/onboard.ts
  • src/lib/onboard/provider-selection.ts

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 1 nice ideas
Since last review: 0 prior items resolved, 1 still applies, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Source-of-truth review needed: src/lib/onboard/provider-selection.ts fallback/recovery resolver: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: provider-selection.ts imports provider recovery and provider-key fallback helpers and returns structured selected/failure outcomes; setupNim consumes those outcomes for message rendering and process exit.

🌱 Nice ideas

  • Add runtime validation for provider-selection resume/fallback paths (src/lib/onboard.ts:3804): The new pure resolver is well covered, but the changed caller still depends on runtime menu construction, environment-derived provider requests, recorded registry/session/live-gateway state, WSL/Windows host detection, abortNonInteractive/process-exit behavior, and user-facing message rendering.
    • Recommendation: Add or identify targeted runtime/integration validation for representative non-interactive and resume paths, especially install-key fallback, recorded provider/model recovery, WSL recorded Ollama fail-closed behavior, Hermes Provider gating, and unsupported Windows-host Ollama rejection.
    • Evidence: The diff replaces the inline setupNim non-interactive provider-selection branch with resolveRequestedProviderSelection(...) in src/lib/onboard.ts, while src/lib/onboard/provider-selection.test.ts covers the pure resolver only.
Consider writing more tests for
  • **Runtime validation** — Validate non-interactive onboard with NEMOCLAW_PROVIDER=install-ollama selects running ollama when the dynamic menu no longer exposes install-ollama.. Unit coverage for resolveRequestedProviderSelection is strong, but the changed behavior crosses setupNim host-glue boundaries: dynamic provider menu construction, environment normalization, registry/session/live-gateway readers, WSL/Windows detection, abortNonInteractive/process.exit, and user-facing message rendering.
  • **Runtime validation** — Validate resume/non-interactive onboard with recorded openai-api recovers the openai option and recorded model through setupNim.. Unit coverage for resolveRequestedProviderSelection is strong, but the changed behavior crosses setupNim host-glue boundaries: dynamic provider menu construction, environment normalization, registry/session/live-gateway readers, WSL/Windows detection, abortNonInteractive/process.exit, and user-facing message rendering.
  • **Runtime validation** — Validate WSL resume with recorded ollama-local and reachable Windows-host Ollama exits with the explicit override hint instead of selecting Windows-host ollama.. Unit coverage for resolveRequestedProviderSelection is strong, but the changed behavior crosses setupNim host-glue boundaries: dynamic provider menu construction, environment normalization, registry/session/live-gateway readers, WSL/Windows detection, abortNonInteractive/process.exit, and user-facing message rendering.
  • **Runtime validation** — Validate a non-Hermes agent with NEMOCLAW_PROVIDER=hermesProvider exits with the Hermes-agent gating message.. Unit coverage for resolveRequestedProviderSelection is strong, but the changed behavior crosses setupNim host-glue boundaries: dynamic provider menu construction, environment normalization, registry/session/live-gateway readers, WSL/Windows detection, abortNonInteractive/process.exit, and user-facing message rendering.
  • **Runtime validation** — Validate unsupported Windows-host Ollama requests exit before compatible fallback to ollama.. Unit coverage for resolveRequestedProviderSelection is strong, but the changed behavior crosses setupNim host-glue boundaries: dynamic provider menu construction, environment normalization, registry/session/live-gateway readers, WSL/Windows detection, abortNonInteractive/process.exit, and user-facing message rendering.
  • **Add runtime validation for provider-selection resume/fallback paths** — Add or identify targeted runtime/integration validation for representative non-interactive and resume paths, especially install-key fallback, recorded provider/model recovery, WSL recorded Ollama fail-closed behavior, Hermes Provider gating, and unsupported Windows-host Ollama rejection.
  • **Acceptance clause:** Refs Umbrella: refactor onboarding into a serializable FSM #3802 — add test evidence or identify existing coverage. The PR body references Umbrella: refactor onboarding into a serializable FSM #3802, but deterministic context reports linkedIssues: [] and did not provide issue body or comments to extract literal acceptance clauses from.
  • **Acceptance clause:** Refactor provider selection without changing provider behavior — add test evidence or identify existing coverage. The resolver preserves the observed branch structure in src/lib/onboard.ts and has focused unit coverage, but setupNim runtime glue is not directly exercised by the new tests.
Since last review details

Current findings:

  • Source-of-truth review needed: src/lib/onboard/provider-selection.ts fallback/recovery resolver: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: provider-selection.ts imports provider recovery and provider-key fallback helpers and returns structured selected/failure outcomes; setupNim consumes those outcomes for message rendering and process exit.
  • Add runtime validation for provider-selection resume/fallback paths (src/lib/onboard.ts:3804): The new pure resolver is well covered, but the changed caller still depends on runtime menu construction, environment-derived provider requests, recorded registry/session/live-gateway state, WSL/Windows host detection, abortNonInteractive/process-exit behavior, and user-facing message rendering.
    • Recommendation: Add or identify targeted runtime/integration validation for representative non-interactive and resume paths, especially install-key fallback, recorded provider/model recovery, WSL recorded Ollama fail-closed behavior, Hermes Provider gating, and unsupported Windows-host Ollama rejection.
    • Evidence: The diff replaces the inline setupNim non-interactive provider-selection branch with resolveRequestedProviderSelection(...) in src/lib/onboard.ts, while src/lib/onboard/provider-selection.test.ts covers the pure resolver only.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27303793577
Target ref: 7bffadf2530fbb0ff71f886bcb81ebf69c0cf827
Workflow ref: main
Requested jobs: onboard-negative-paths-e2e,cloud-onboard-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
onboard-negative-paths-e2e ✅ success

@cv

cv commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Validation update for 7bffadf2530fbb0ff71f886bcb81ebf69c0cf827:

  • Addressed the PR Advisor follow-up by removing the inherited unreachable ambiguous-recorded-vllm resolver branch; current providerNameToOptionKey() maps recorded vllm-local without a NIM container to vllm.
  • Standard PR CI is green: static checks, build/typecheck, installer integration, CLI shards + aggregate checks, plugin tests, test-e2e-ollama-proxy, commit lint, DCO, CodeQL, ShellCheck, macOS, and guardrails.
  • PR Review Advisor is clean for review-blocking items: 0 needs attention, 0 worth checking, 1 nice idea for additional runtime validation.
  • CodeRabbit skipped review because this is draft; its status context is green.
  • Local validation passed: npm run build:cli, npm run typecheck:cli, Biome lint on touched files, targeted Vitest coverage including test/onboard.test.ts, and git diff --check.
  • Required live validation passed:

Remaining advisory note is optional/future coverage for deeper provider-selection runtime cases such as recorded-provider recovery and WSL Ollama host-boundary regression; the required advisor-selected runtime checks for this draft are green.

@cv

cv commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Next draft stack slice planned on top of this PR: extract the interactive provider prompt/menu rendering from setupNim.

Proposed boundary:

  • Move the interactive branch that prints detected local inference suggestions, renders provider menu labels, resolves the NEMOCLAW_PROVIDER default hint, prompts for a numbered choice, and delegates to selectFromNumberedMenuOrExit into a small helper under src/lib/onboard/.
  • Keep non-interactive/resume provider resolution in provider-selection.ts from this PR.
  • Keep provider-specific credential/model setup in setupNim for later slices.

This should be behavior-preserving and small enough for a draft PR: it only makes the prompt/default-selection contract directly testable and trims the next chunk of presentation logic out of setupNim.

@cv

cv commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Next draft stack slice opened: #5175 (refactor(onboard): extract provider prompt selection).

What moved:

  • Added promptForInferenceProviderSelection in src/lib/onboard/provider-selection-prompt.ts.
  • Moved the interactive provider-prompt branch out of setupNim: detected vLLM/Ollama suggestions, provider menu rendering, NEMOCLAW_PROVIDER default hint handling, prompt text, and delegation to selectFromNumberedMenuOrExit.
  • Kept non-interactive/resume provider resolution in refactor(onboard): extract provider selection resolver #5173 and provider-specific setup in setupNim.

Coverage added:

  • src/lib/onboard/provider-selection-prompt.test.ts covers menu rendering/default build provider, vLLM+Ollama suggestion text, matching env-provider default selection, and fallback to build when the env hint is unavailable.

Local validation on head 5ed50a80a:

  • npm run build:cli passed.
  • npm run typecheck:cli passed.
  • Biome format/lint on touched files passed.
  • Targeted Vitest passed: provider-selection-prompt, provider-selection, provider-menu, provider-host-state, and test/onboard.test.ts.
  • git diff --check passed.

Known local environment note remains: full npx prek run --all-files / full npm test are not claimed because of the unrelated local release-tag signing fixture issue; this stack still does not touch test/release-latest-tag.test.ts.

cv added 3 commits June 10, 2026 23:48
…nu-merge-update

# Conflicts:
#	test/e2e/test-onboard-inference-smoke.sh
…' into codex/tmp-provider-host-state-merge-update
…e-flow' into codex/tmp-provider-selection-merge-update
@cv cv added the area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow label Jun 11, 2026
Base automatically changed from codex/onboard-provider-host-state-flow to main June 11, 2026 08:27
# Conflicts:
#	src/lib/onboard.ts
#	src/lib/onboard/provider-host-state.test.ts
#	src/lib/onboard/provider-host-state.ts
@cv cv marked this pull request as ready for review June 11, 2026 08:37
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27334817467
Target ref: efc7ebe3ac5cfa436fdcfd253f257003d1d341eb
Workflow ref: main
Requested jobs: cloud-onboard-e2e,onboard-negative-paths-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
onboard-negative-paths-e2e ✅ success

@cv cv merged commit 463d59f into main Jun 11, 2026
37 of 39 checks passed
@cv cv deleted the codex/onboard-provider-selection-flow branch June 11, 2026 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: architecture Architecture, design debt, major refactors, or maintainability area: cli Command line interface, flags, terminal UX, or output area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow refactor PR restructures code without intended behavior change v0.0.64 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants