fix(onboard): detect stale or unrefreshed NVIDIA CDI specs in host preflight by zyang-dev · Pull Request #4733 · NVIDIA/NemoClaw

zyang-dev · 2026-06-03T23:06:24Z

Summary

--gpus all resolves GPUs via the CDI spec, so a stale /etc/cdi/nvidia.yaml (e.g. an nvidia-uvm device major that drifted after a driver change) silently injects the wrong device and breaks CUDA in GPU containers (nvidia-smi works, cuInit fails). This adds host-preflight detection for a missing, stale, or unrefreshed NVIDIA CDI spec and guides the user to regenerate it / enable the nvidia-cdi-refresh units.

Related Issue

Fixes #4658

Changes

Check only the highest-precedence effective nvidia.com/gpu CDI spec when comparing declared /dev GPU device nodes with live device numbers.
Ignore stale /etc/cdi/nvidia.yaml leftovers when a fresh /var/run/cdi/nvidia.yaml overrides them.
Default omitted CDI device-node minor values to 0 and prefer hostPath for host-side stat checks.
Keep missing-spec remediation unchanged, including direct nvidia-ctk cdi generate fallback.
Split stale-spec remediation to use nvidia-cdi-refresh service refresh commands only, with optional manual leftover removal in displayed guidance.
Keep NVIDIA CDI refresh-unit health as a non-blocking warning unless the spec is missing or proven stale.
Move CDI preflight coverage into src/lib/onboard/preflight-cdi.test.ts and add regression coverage for precedence, stale/matching/absent devices, omitted-minor nodes, hostPath, refresh warnings, and stale auto-fix behavior.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
npm run docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Reproduced #4658 and manually verified that this PR fixes #4658

Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com

Summary by CodeRabbit

New Features
- Detects stale NVIDIA CDI GPU device specs and attempts automatic refresh via the NVIDIA CDI refresh service during host preflight; provides manual guidance for non-systemd hosts.
Bug Fixes
- Treats stale CDI GPU specs as blocking when CDI injection is configured.
- Installer repair now prefers enabling/starting the NVIDIA refresh service (with elevation) and updates messaging to say specs may be "missing or stale" and are being refreshed.
Tests
- Added/updated tests covering CDI detection, stale vs missing remediation plans, and installer repair behaviors.

…eflight Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

…oid false-positive blocks Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

coderabbitai · 2026-06-03T23:06:39Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Detects and flags missing or stale NVIDIA CDI GPU specs, compares declared vs live device nodes, probes nvidia-cdi-refresh unit health, emits generate/refresh/warn remediation actions, performs installer-level refresh via systemd (with sudo when needed), and adds dedicated tests and harness updates.

Changes

CDI Stale Spec Detection and Refresh Remediation

Layer / File(s)	Summary
docker-cdi module: types, parsing, detection, and command builders `src/lib/onboard/docker-cdi.ts`, `src/lib/onboard/docker-cdi.test.ts`	Adds CDI-focused public types, spec-dir parsing, effective-spec selection, device-node collection, mismatch detection, and command/explanation builders for repair/refresh/warn; includes parsing, staleness detection, and remediation command tests.
Preflight assessment and remediation planner `src/lib/onboard/preflight.ts`	Extends `HostAssessment` with CDI fields, probes `nvidia-cdi-refresh` systemd state, computes stale/mismatch/needsRepair flags, and updates `planHostRemediation` to emit generate/refresh/warn actions with toolkit/WSL/systemctl branching.
onboard: runtime assertion update `src/lib/onboard.ts`	`assertCdiNvidiaGpuSpecPresent` treats missing OR stale CDI specs as fatal when CDI injection is configured and updates the fatal message text.
Installer-level stale-spec repair and messaging `scripts/install.sh`	Adds `repair_installer_stale_nvidia_cdi_spec(flagged_file)` to enable/start `nvidia-cdi-refresh` (uses sudo when needed); `repair_installer_nvidia_cdi_spec` now distinguishes `missing` vs `stale` and delegates; CLI messaging updated to reference “missing or stale” / “Refreshing NVIDIA CDI device spec”.
Tests, harness updates, and test reorg `src/lib/onboard/preflight-cdi.test.ts`, `src/lib/onboard/docker-cdi.test.ts`, `src/lib/onboard/preflight.test.ts`, `test/install-preflight.test.ts`	Adds dedicated CDI preflight and docker-cdi test suites; removes moved CDI blocks from `preflight.test.ts`; extends install-preflight harness to simulate `stale` state, pass `CDI_STALE_FILE`, and verify refresh-only repair behavior; updates assertions.

Sequence Diagram

sequenceDiagram
  participant CLI as assessHost
  participant DockerCdi as docker-cdi.ts
  participant Systemctl as systemctl
  participant Planner as planHostRemediation
  participant Installer as installer (scripts/install.sh)

  CLI->>DockerCdi: discover effective CDI spec / collect device nodes
  DockerCdi->>CLI: report mismatch / stale
  CLI->>Systemctl: probe nvidia-cdi-refresh (is-enabled/is-failed)
  CLI->>Planner: compute remediation (missing | stale | warn)
  Planner->>Installer: emit repair plan (generate / refresh / warn)
  Installer->>Systemctl: enable/start nvidia-cdi-refresh (sudo if needed)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA/NemoClaw#4298: Modifies CDI remediation/validation decision points and WSL+Docker Desktop exemptions; related to CDI handling and installer repair flow.

Suggested labels

fix, Docker, bug-fix, Platform: Ubuntu

Suggested reviewers

cv

"I sniffed the stale CDI trace,
I nudged systemd to start the race,
I checked majors, minors, and specs,
I hopped to enable refresh checks,
Now GPUs wake with sunny grace!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main change: detecting stale or unrefreshed NVIDIA CDI specs in host preflight, which aligns with the core objective of adding staleness detection to prevent incorrect device node injection.
Linked Issues check	✅ Passed	The PR implements the primary coding objective from `#4658`: detecting and handling stale NVIDIA CDI specs that cause incorrect device node injection and CUDA failures in containers. All major requirements are met.
Out of Scope Changes check	✅ Passed	All changes are focused on NVIDIA CDI spec detection, staleness validation, and remediation. No out-of-scope modifications were introduced; changes remain targeted to the CDI preflight assessment and repair workflow.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/cdi-refresh-preflight

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

github-actions · 2026-06-03T23:10:31Z

PR Review Advisor

Findings: 1 needs attention, 6 worth checking, 0 nice ideas
Since last review: 1 prior item resolved, 4 still apply, 1 new item found

Review findings

🛠️ Needs attention

Add targeted runtime validation for the linked DGX Spark vLLM path: The linked issue's expected result is end-to-end runtime behavior: the vLLM container starts, serves the target model on port 8000, onboarding reaches inference verification, and status shows a running provider. This PR adds strong unit and installer-preflight coverage for CDI detection and repair, but the changed installer/onboarding/sandbox infrastructure does not include targeted runtime validation proving that the documented DGX Spark vLLM path is unblocked.
- Recommendation: Add or identify targeted runtime/integration validation for the affected GPU/CDI path, such as confirming stale CDI preflight repair clears and the documented vLLM onboarding path reaches inference verification and a running provider status. Keep this separate from reporting external CI/E2E status.
- Evidence: Issue [DGX Spark][Inference] PR #4619 vLLM/NVFP4 path crashes on container start with "CUDA unknown error" #4658 Expected Result: "Container `nemoclaw-vllm` starts cleanly, vLLM serves `nvidia/Qwen3.6-35B-A3B-NVFP4` on port 8000, onboard proceeds to `[4/8] Verifying inference` and beyond. `nemoclaw <name> status` shows running provider." Deterministic testDepth verdict: runtime_validation_recommended for `scripts/install.sh`, `src/lib/onboard.ts`, `src/lib/onboard/docker-cdi.ts`, and `src/lib/onboard/preflight.ts`.

🔎 Worth checking

Source-of-truth review needed: NVIDIA CDI missing/stale spec repair in installer and host preflight: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: `repair_installer_stale_nvidia_cdi_spec`, `buildNvidiaCdiRepairCommands`, `buildStaleCdiWarnCommands`, `buildStaleCdiManualWarnCommands`, and `sudo nvidia-ctk cdi generate --output=...` implement localized recovery/fallback behavior.
Source-of-truth review needed: Stale CDI mismatch path transport: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: `findCdiDeviceNodeMismatch()` returns `${node.filePath} ${node.path}=...`; `extractCdiMismatchFilePath()` and the installer node snippet both split at whitespace.
Preserve stale CDI mismatch file paths that contain whitespace (src/lib/onboard/docker-cdi.ts:274): The displayed remediation command builders shell-quote paths safely, but stale mismatch extraction still treats the first whitespace-delimited token as the file path. CDI spec directories are host filesystem paths, so a path containing spaces is truncated before it reaches remediation output or installer stale repair messaging.
- Recommendation: Return structured mismatch data instead of encoding it as a whitespace-delimited string, or encode the file path in a reversible way. Add an end-to-end test that creates a stale mismatch from a CDI path containing spaces and verifies the full path reaches both TypeScript remediation output and the installer stale-repair path.
- Evidence: `findCdiDeviceNodeMismatch()` returns `${node.filePath} ${node.path}=...`; `extractCdiMismatchFilePath()` slices at the first whitespace, and the installer probe in `scripts/install.sh` uses `mismatch.trim().split(/\s+/, 1)[0]`. Existing builder tests cover quoting `/tmp/cdi dir/...`, but not extraction from the mismatch string.
Avoid lossy host-path parsing before privileged CDI repair guidance (src/lib/onboard/docker-cdi.ts:374): Host-derived CDI paths flow into sudo/manual remediation guidance. The commands are shell-quoted, so this is not a confirmed command-injection issue, but the whitespace-delimited mismatch string loses path integrity before privileged repair guidance is generated.
- Recommendation: Carry the stale CDI file path as a structured field through the assessment and installer repair probe, then keep shell-quoting at the final command-rendering boundary. Add a negative test with spaces and shell metacharacters in the effective stale spec path.
- Evidence: `buildStaleCdiWarnCommands()` and `buildStaleCdiManualWarnCommands()` quote their `flaggedFilePath`, but `flaggedFilePath` is produced by `extractCdiMismatchFilePath()` from a whitespace-delimited string. The installer repeats the same split before calling `repair_installer_stale_nvidia_cdi_spec`.
Document the removal condition for the CDI refresh workaround (src/lib/onboard/docker-cdi.ts:391): The PR adds localized recovery behavior around stale or missing NVIDIA CDI specs and refresh-service health. The invalid state, source boundary, and regression tests are now mostly clear, but the code still does not state when this host workaround can be removed. Without a removal condition, host-workaround logic can become permanent source-of-truth drift.
- Recommendation: Add a short comment or tracking note near the refresh/direct-generation remediation path that states the removal condition, for example when OpenShell/Docker no longer rely on CDI for GPU injection or NVIDIA Container Toolkit guarantees service-managed CDI specs are refreshed before Docker/OpenShell use them.
- Evidence: Recovery paths include `repair_installer_stale_nvidia_cdi_spec`, `buildNvidiaCdiRepairCommands`, `buildStaleCdiWarnCommands`, `buildStaleCdiManualWarnCommands`, and direct `nvidia-ctk cdi generate` fallback guidance. Tests cover stale effective specs, refresh warnings, missing-spec fallback, toolkit absence, and WSL Docker Desktop exemption, but no removal condition is documented near the workaround.
Offset new CDI helper monolith growth (src/lib/onboard/docker-cdi.ts:1): The PR successfully moves CDI logic out of the already-large preflight file, but the new `docker-cdi.ts` helper grew into a 506-line module in one patch. This creates a new onboarding host-glue hotspot in a security-sensitive area that handles host filesystem parsing and privileged repair guidance.
- Recommendation: Consider splitting parsing, host assessment, and remediation-command rendering into smaller modules or otherwise offset the growth before this becomes another hard-to-review monolith.
- Evidence: Deterministic monolith evidence: `src/lib/onboard/docker-cdi.ts` baseLines 57, headLines 506, delta +449, severity blocker rationale: "Current monolith grew by 20 or more lines; extract or offset the growth before merge."

🌱 Nice ideas

None.

Since last review details

Current findings:

Source-of-truth review needed: NVIDIA CDI missing/stale spec repair in installer and host preflight: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: `repair_installer_stale_nvidia_cdi_spec`, `buildNvidiaCdiRepairCommands`, `buildStaleCdiWarnCommands`, `buildStaleCdiManualWarnCommands`, and `sudo nvidia-ctk cdi generate --output=...` implement localized recovery/fallback behavior.
Source-of-truth review needed: Stale CDI mismatch path transport: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: `findCdiDeviceNodeMismatch()` returns `${node.filePath} ${node.path}=...`; `extractCdiMismatchFilePath()` and the installer node snippet both split at whitespace.
Add targeted runtime validation for the linked DGX Spark vLLM path: The linked issue's expected result is end-to-end runtime behavior: the vLLM container starts, serves the target model on port 8000, onboarding reaches inference verification, and status shows a running provider. This PR adds strong unit and installer-preflight coverage for CDI detection and repair, but the changed installer/onboarding/sandbox infrastructure does not include targeted runtime validation proving that the documented DGX Spark vLLM path is unblocked.
- Recommendation: Add or identify targeted runtime/integration validation for the affected GPU/CDI path, such as confirming stale CDI preflight repair clears and the documented vLLM onboarding path reaches inference verification and a running provider status. Keep this separate from reporting external CI/E2E status.
- Evidence: Issue [DGX Spark][Inference] PR #4619 vLLM/NVFP4 path crashes on container start with "CUDA unknown error" #4658 Expected Result: "Container `nemoclaw-vllm` starts cleanly, vLLM serves `nvidia/Qwen3.6-35B-A3B-NVFP4` on port 8000, onboard proceeds to `[4/8] Verifying inference` and beyond. `nemoclaw <name> status` shows running provider." Deterministic testDepth verdict: runtime_validation_recommended for `scripts/install.sh`, `src/lib/onboard.ts`, `src/lib/onboard/docker-cdi.ts`, and `src/lib/onboard/preflight.ts`.
Preserve stale CDI mismatch file paths that contain whitespace (src/lib/onboard/docker-cdi.ts:274): The displayed remediation command builders shell-quote paths safely, but stale mismatch extraction still treats the first whitespace-delimited token as the file path. CDI spec directories are host filesystem paths, so a path containing spaces is truncated before it reaches remediation output or installer stale repair messaging.
- Recommendation: Return structured mismatch data instead of encoding it as a whitespace-delimited string, or encode the file path in a reversible way. Add an end-to-end test that creates a stale mismatch from a CDI path containing spaces and verifies the full path reaches both TypeScript remediation output and the installer stale-repair path.
- Evidence: `findCdiDeviceNodeMismatch()` returns `${node.filePath} ${node.path}=...`; `extractCdiMismatchFilePath()` slices at the first whitespace, and the installer probe in `scripts/install.sh` uses `mismatch.trim().split(/\s+/, 1)[0]`. Existing builder tests cover quoting `/tmp/cdi dir/...`, but not extraction from the mismatch string.
Avoid lossy host-path parsing before privileged CDI repair guidance (src/lib/onboard/docker-cdi.ts:374): Host-derived CDI paths flow into sudo/manual remediation guidance. The commands are shell-quoted, so this is not a confirmed command-injection issue, but the whitespace-delimited mismatch string loses path integrity before privileged repair guidance is generated.
- Recommendation: Carry the stale CDI file path as a structured field through the assessment and installer repair probe, then keep shell-quoting at the final command-rendering boundary. Add a negative test with spaces and shell metacharacters in the effective stale spec path.
- Evidence: `buildStaleCdiWarnCommands()` and `buildStaleCdiManualWarnCommands()` quote their `flaggedFilePath`, but `flaggedFilePath` is produced by `extractCdiMismatchFilePath()` from a whitespace-delimited string. The installer repeats the same split before calling `repair_installer_stale_nvidia_cdi_spec`.
Document the removal condition for the CDI refresh workaround (src/lib/onboard/docker-cdi.ts:391): The PR adds localized recovery behavior around stale or missing NVIDIA CDI specs and refresh-service health. The invalid state, source boundary, and regression tests are now mostly clear, but the code still does not state when this host workaround can be removed. Without a removal condition, host-workaround logic can become permanent source-of-truth drift.
- Recommendation: Add a short comment or tracking note near the refresh/direct-generation remediation path that states the removal condition, for example when OpenShell/Docker no longer rely on CDI for GPU injection or NVIDIA Container Toolkit guarantees service-managed CDI specs are refreshed before Docker/OpenShell use them.
- Evidence: Recovery paths include `repair_installer_stale_nvidia_cdi_spec`, `buildNvidiaCdiRepairCommands`, `buildStaleCdiWarnCommands`, `buildStaleCdiManualWarnCommands`, and direct `nvidia-ctk cdi generate` fallback guidance. Tests cover stale effective specs, refresh warnings, missing-spec fallback, toolkit absence, and WSL Docker Desktop exemption, but no removal condition is documented near the workaround.
Offset new CDI helper monolith growth (src/lib/onboard/docker-cdi.ts:1): The PR successfully moves CDI logic out of the already-large preflight file, but the new `docker-cdi.ts` helper grew into a 506-line module in one patch. This creates a new onboarding host-glue hotspot in a security-sensitive area that handles host filesystem parsing and privileged repair guidance.
- Recommendation: Consider splitting parsing, host assessment, and remediation-command rendering into smaller modules or otherwise offset the growth before this becomes another hard-to-review monolith.
- Evidence: Deterministic monolith evidence: `src/lib/onboard/docker-cdi.ts` baseLines 57, headLines 506, delta +449, severity blocker rationale: "Current monolith grew by 20 or more lines; extract or offset the growth before merge."

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

github-actions · 2026-06-03T23:10:41Z

E2E Advisor Recommendation

Required E2E: cloud-onboard-e2e, gpu-e2e
Optional E2E: onboard-negative-paths-e2e, gpu-double-onboard-e2e

Dispatch hint: cloud-onboard-e2e,gpu-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

cloud-onboard-e2e (medium): Covers the installer plus non-interactive onboarding path on a clean Linux Docker environment, which is directly affected by install.sh and preflight/onboard changes even when the host is not a GPU/CDI machine.
gpu-e2e (high): Required because the PR changes NVIDIA GPU CDI detection/repair and the onboarding guard for OpenShell GPU startup. This job runs the real GPU/Ollama user flow on a GPU runner and is the closest existing E2E coverage for Docker CDI GPU passthrough into the sandbox.

Optional E2E

onboard-negative-paths-e2e (medium): Useful adjacent coverage for non-interactive onboarding validation and friendly failure paths after preflight error handling changed, but it does not specifically synthesize missing/stale NVIDIA CDI specs.
gpu-double-onboard-e2e (high): Optional extra confidence that repeated GPU onboarding still works after CDI repair/preflight changes; higher cost and not specifically targeted at stale CDI repair.

New E2E recommendations

GPU Docker CDI stale-spec repair (high): Existing E2E jobs exercise a normal GPU environment but do not deliberately create a stale effective nvidia.com/gpu CDI spec or unhealthy nvidia-cdi-refresh units, so they may miss regressions in the new stale-spec repair path and messaging.
- Suggested test: Add a GPU-runner E2E that snapshots/restores host CDI state, injects a stale /etc/cdi or /var/run/cdi nvidia.yaml device-node mismatch, runs install/onboard preflight, verifies the refresh-service repair path is attempted, and confirms onboarding succeeds or blocks with actionable remediation.
installer CDI missing-spec repair (medium): Unit tests cover missing-spec command construction, but no existing E2E validates installer behavior when Docker advertises CDISpecDirs and nvidia.com/gpu is absent on a real host.
- Suggested test: Add an installer preflight E2E fixture on a GPU/CDI-capable runner that temporarily hides the NVIDIA CDI spec, runs scripts/install.sh --non-interactive, and asserts the installer invokes the refresh/generate path and re-assesses the effective spec before onboarding.

Dispatch hint

Workflow: .github/workflows/nightly-e2e.yaml
jobs input: cloud-onboard-e2e,gpu-e2e

github-actions · 2026-06-03T23:13:03Z

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw, gpu-repo-local-ollama-openclaw
Optional scenario E2E: ubuntu-no-docker-preflight-negative, wsl-repo-cloud-openclaw

Dispatch required scenario E2E:

gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

ubuntu-repo-cloud-openclaw: Core installer/onboarding preflight code changed. The baseline Ubuntu repo-install OpenClaw scenario is the smallest standard scenario that exercises the repo-current install path plus onboarding preflight and sandbox startup on a normal Docker host.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw
gpu-repo-local-ollama-openclaw: The PR changes NVIDIA Docker CDI detection, stale/missing CDI spec remediation, and GPU preflight blocking behavior. This is the only dispatchable scenario routed through the gpu-docker-cdi runtime and therefore the only scenario path that directly exercises the changed CDI/GPU onboarding surface.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Optional scenario E2E

ubuntu-no-docker-preflight-negative: Optional adjacent coverage for preflight failure handling. The main changes are CDI/GPU-specific, but preflight assessment/remediation code was refactored, so the no-Docker negative scenario can catch regressions in early preflight failure classification without running the full sandbox path.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-no-docker-preflight-negative
wsl-repo-cloud-openclaw: Optional platform-adjacent coverage because the changed CDI preflight paths explicitly exempt WSL Docker Desktop runtime behavior. This uses a special Windows/WSL runner, so it is not primary unless WSL behavior is a release concern.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=wsl-repo-cloud-openclaw

Relevant changed files

scripts/install.sh
src/lib/onboard.ts
src/lib/onboard/docker-cdi.ts
src/lib/onboard/preflight.ts

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/install.sh`:
- Around line 1903-1912: The stale-spec auto-repair branch currently runs when
host.cdiNvidiaGpuSpecStale && host.cdiNvidiaGpuSpecNeedsRepair &&
!host.cdiNvidiaGpuSpecMissing && !isWslDockerDesktopRuntime(host), which can
prompt sudo even when nvidia-container-toolkit is absent; add a requirement that
host.nvidiaContainerToolkitInstalled is true to that condition (i.e., gate the
stale branch on host.nvidiaContainerToolkitInstalled) so the installer defers to
the later "install toolkit first" remediation instead of attempting auto-repair.

In `@src/lib/onboard.ts`:
- Around line 1844-1849: The guard that checks host.cdiNvidiaGpuSpecNeedsRepair,
host.cdiNvidiaGpuSpecMissing and optedOutGpuPassthrough is verbose and inflates
the file; collapse it into a single compact conditional and keep behavior
identical (early return when not needing repair or when opted out). Then
compress the subsequent error logging/throwing block (the code that runs when
the check fails) by formatting the error message as a single-line template
string or using a short helper to build the message so you remove extra blank
lines and multi-line concatenation; update references to the same symbols
(host.cdiNvidiaGpuSpecNeedsRepair, host.cdiNvidiaGpuSpecMissing,
optedOutGpuPassthrough) and to the function that emits the error so behavior
remains unchanged.

In `@src/lib/onboard/preflight.ts`:
- Around line 1189-1191: The stale-spec branch must mirror the missing-spec
branch by checking assessment.systemctlAvailable before producing systemd-only
commands: change the repairCommands assignment so when missingSpec is false it
uses buildStaleCdiWarnCommands(flaggedFilePath) only if
assessment.systemctlAvailable is true, otherwise call a new or existing
non-systemd fallback (e.g., buildStaleCdiManualWarnCommands or return a
non-blocking warning action) that emits manual remediation steps instead of
systemctl commands; update references to repairCommands,
buildStaleCdiWarnCommands, buildNvidiaCdiRepairCommands,
assessment.systemctlAvailable, specPath and flaggedFilePath accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f820c13a-4139-4ddd-be6a-1414d00efb93

📥 Commits

Reviewing files that changed from the base of the PR and between 11fcbf7 and 366f996.

📒 Files selected for processing (6)

scripts/install.sh
src/lib/onboard.ts
src/lib/onboard/preflight-cdi.test.ts
src/lib/onboard/preflight.test.ts
src/lib/onboard/preflight.ts
test/install-preflight.test.ts

💤 Files with no reviewable changes (1)

src/lib/onboard/preflight.test.ts

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lib/onboard/preflight-cdi.test.ts (1)
51-64: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make these assessHost cases fully hermetic.

These tests don't inject runCaptureImpl, so assessHost() still shells out for package-manager detection. That makes the suite depend on whatever tools are installed on the runner instead of just the fixture data. Add a shared no-op/stub runCaptureImpl for these cases so the unit tests stop touching the host environment. As per coding guidelines, **/*.test.{js,ts}: Mock external dependencies in unit tests; do not call real NVIDIA APIs in unit tests.

Also applies to: 71-86, 222-240
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/preflight-cdi.test.ts` around lines 51 - 64, The tests call
assessHost but don't inject runCaptureImpl, so package-manager detection still
shells out; add a hermetic stub by passing runCaptureImpl that returns a
deterministic capture object (e.g., { stdout: "", stderr: "", code: 0 } or
similar) to each assessHost invocation in this test file; update the assessHost
calls (the ones in this test and the other similar blocks) to include
runCaptureImpl: () => ({ stdout: "", stderr: "", code: 0 }) so the unit tests no
longer touch the real host or NVIDIA tools.

♻️ Duplicate comments (1)

src/lib/onboard/preflight.ts (1)
894-896: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep stale-spec remediation runnable without systemd.

When systemctlAvailable === false, the stale-spec path still emits only systemctl commands and then marks the action blocking. That leaves non-systemd Linux hosts with no executable repair path, even though the missing-spec path already has a fallback. Gate this branch the same way or downgrade it to a non-blocking/manual warning.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/preflight.ts` around lines 894 - 896, The stale-spec branch
currently always generates systemctl-only remediation via
buildStaleCdiWarnCommands and can be blocking on non-systemd hosts; modify the
logic that sets repairCommands so it checks systemctlAvailable the same way the
missingSpec path does (or make the stale-spec path non-blocking/manual when
systemctlAvailable === false). Specifically, when choosing between
buildNvidiaCdiRepairCommands and buildStaleCdiWarnCommands, gate the
buildStaleCdiWarnCommands call on systemctlAvailable (or replace it with a
non-systemd fallback/no-op warning and clear the blocking flag) using the
existing symbols repairCommands, systemctlAvailable, buildStaleCdiWarnCommands,
buildNvidiaCdiRepairCommands, missingSpec, and flaggedFilePath to locate and
implement the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/docker-cdi.ts`:
- Around line 257-272: The shell commands generated by
buildNvidiaCdiRepairCommands interpolate specDir and specPath raw into strings;
run specDir and specPath through the repo's shell-quoting helper before building
each command string so spaces or metacharacters are safely quoted, and do the
same for flaggedFilePath in the other remediation block referenced around lines
300-305—i.e., locate where flaggedFilePath is interpolated and replace direct
interpolation with the quoted value from the shell-quoting helper.

---

Outside diff comments:
In `@src/lib/onboard/preflight-cdi.test.ts`:
- Around line 51-64: The tests call assessHost but don't inject runCaptureImpl,
so package-manager detection still shells out; add a hermetic stub by passing
runCaptureImpl that returns a deterministic capture object (e.g., { stdout: "",
stderr: "", code: 0 } or similar) to each assessHost invocation in this test
file; update the assessHost calls (the ones in this test and the other similar
blocks) to include runCaptureImpl: () => ({ stdout: "", stderr: "", code: 0 })
so the unit tests no longer touch the real host or NVIDIA tools.

---

Duplicate comments:
In `@src/lib/onboard/preflight.ts`:
- Around line 894-896: The stale-spec branch currently always generates
systemctl-only remediation via buildStaleCdiWarnCommands and can be blocking on
non-systemd hosts; modify the logic that sets repairCommands so it checks
systemctlAvailable the same way the missingSpec path does (or make the
stale-spec path non-blocking/manual when systemctlAvailable === false).
Specifically, when choosing between buildNvidiaCdiRepairCommands and
buildStaleCdiWarnCommands, gate the buildStaleCdiWarnCommands call on
systemctlAvailable (or replace it with a non-systemd fallback/no-op warning and
clear the blocking flag) using the existing symbols repairCommands,
systemctlAvailable, buildStaleCdiWarnCommands, buildNvidiaCdiRepairCommands,
missingSpec, and flaggedFilePath to locate and implement the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a42d9078-1396-486d-a757-485dcbccdbbf

📥 Commits

Reviewing files that changed from the base of the PR and between 9f74eef and 34c7b5f.

📒 Files selected for processing (4)

src/lib/onboard/docker-cdi.test.ts
src/lib/onboard/docker-cdi.ts
src/lib/onboard/preflight-cdi.test.ts
src/lib/onboard/preflight.ts

✅ Files skipped from review due to trivial changes (1)

src/lib/onboard/docker-cdi.test.ts

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lib/onboard/docker-cdi.ts (1)
337-379: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align user-facing explanation with remediation command (nemoclaw onboard) to avoid confusing OpenShell command references.
src/lib/onboard/docker-cdi.ts remediation tells users to rerun nemoclaw onboard (optionally --no-gpu), but the explanation strings attribute the failure to OpenShell’s gateway start --gpu, which is only an internal detail of the GPU passthrough path. Rewording to reference nemoclaw onboard GPU passthrough keeps the cause while matching the command users are actually told to run.
📝 Proposed wording alignment
-    "OpenShell's `gateway start --gpu` injects devices from the CDI spec, so a stale " +
+    "`nemoclaw onboard` (GPU passthrough) injects devices from the CDI spec, so a stale " +
   reasons.push(
-    "OpenShell's `gateway start --gpu` can fail until the CDI spec is refreshed and verified.",
+    "`nemoclaw onboard` GPU passthrough can fail until the CDI spec is refreshed and verified.",
   );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/docker-cdi.ts` around lines 337 - 379, Replace user-facing
references to "OpenShell's `gateway start --gpu`" with the user-facing
remediation command "nemoclaw onboard" (mentioning the optional --gpu flag where
appropriate) in the strings in this file: update the long explanatory return
string near the top of docker-cdi.ts and the final reasons.push in
explainNvidiaCdiRepairReason so they reference running "nemoclaw onboard" (or
"nemoclaw onboard --no-gpu"/"--gpu" as applicable) instead of the internal
OpenShell command; keep the rest of each message unchanged so the technical
detail and mitigation advice remain intact.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/lib/onboard/docker-cdi.ts`:
- Around line 337-379: Replace user-facing references to "OpenShell's `gateway
start --gpu`" with the user-facing remediation command "nemoclaw onboard"
(mentioning the optional --gpu flag where appropriate) in the strings in this
file: update the long explanatory return string near the top of docker-cdi.ts
and the final reasons.push in explainNvidiaCdiRepairReason so they reference
running "nemoclaw onboard" (or "nemoclaw onboard --no-gpu"/"--gpu" as
applicable) instead of the internal OpenShell command; keep the rest of each
message unchanged so the technical detail and mitigation advice remain intact.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 05a9f650-d750-47e3-93a9-56686c4d79b3

📥 Commits

Reviewing files that changed from the base of the PR and between ead83f7 and 529f1de.

📒 Files selected for processing (2)

src/lib/onboard/docker-cdi.test.ts
src/lib/onboard/docker-cdi.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/onboard/docker-cdi.test.ts

…mediation command paths Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

prekshivyas

LGTM — approving. This is a careful fix for a nasty silent failure (stale CDI → nvidia-smi works but cuInit fails). Verified the core logic:

Precedence: findEffectiveNvidiaCdiSpec iterates [...specDirs].reverse() and returns the first match, so the highest-precedence dir wins — Docker's CDISpecDirs is [/etc/cdi, /var/run/cdi], so a fresh /var/run/cdi/nvidia.yaml correctly overrides a stale /etc/cdi leftover. Matches the nvidia-cdi-refresh model.
Fail-safe comparison: findCdiDeviceNodeMismatch continues when the live device can't be stat'd and only flags when a /dev/* node genuinely exists with different major/minor (stat -c %t %T parsed as hex). Omitted minor defaults to 0, hostPath preferred. So it won't false-positive on absent/non-GPU nodes.
Gating: detection only applies on linux + NVIDIA GPU + docker-reported CDI dirs, and excludes WSL+docker-desktop. Missing/proven-stale → blocking: true (right call — better to halt than onboard a broken GPU sandbox); refresh-unit health alone is a non-blocking warning.
Shell safety: remediation commands quote dynamic paths via shellQuote (quotedSpecDir/quotedSpecPath/quotedFlaggedFilePath).
install.sh's stale auto-repair branch is gated on nvidiaContainerToolkitInstalled in the assessment emitter, so it won't fire nvidia-ctk/refresh on a toolkit-less host.

Tests cover precedence, stale/matching/absent devices, omitted-minor, hostPath, refresh warnings, and stale auto-fix; CI is green.

Re CodeRabbit: its onboard.ts growth-guardrail "Major" is stale (codebase-growth-guardrails passes now), and the quote-paths finding is already handled by the shellQuote calls above — both can be disregarded. One courtesy double-check, non-blocking: confirm the systemctlAvailable === false stale path (CodeRabbit @ preflight.ts:1191) renders the manual-removal guidance you intend — the code does branch to buildStaleCdiManualWarnCommands, so it looks right; just worth an eyeball. I did not run the suite locally or exhaustively trace the manual path.

## Summary - Add the v0.0.59 release notes from the GitHub announcement discussion. - Refresh local inference and credential-storage guidance for the current release behavior. - Regenerate the user skills from the updated Fern docs. - Tighten release-prep and docs review guidance for generated skills, PR labels, and shared `$$nemoclaw` command placeholders. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" --glob '*.{md,mdx}'` - `git diff --check` - `npm run docs` (rerun outside sandbox after sandbox-only `tsx` IPC permission failure) - `npm run typecheck:cli` - Pre-commit hooks during commit passed, including markdownlint, docs-to-skills verification, gitleaks, commitlint, and skills YAML tests. ## Source Summary - #3679, #4437, #4681, #4766, #4772, #4775, #4786 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`, `docs/reference/troubleshooting.mdx`: Summarize OpenClaw 2026.5.27 compatibility, runtime path pinning, plugin registry recovery, live gateway reconciliation, and clearer host-alias/startup diagnostics. - #4332, #4402, #4769, #4776, #4779 -> `docs/about/release-notes.mdx`, `docs/inference/inference-options.mdx`, `docs/inference/use-local-inference.mdx`, `docs/inference/switch-inference-providers.mdx`: Document the release inference changes covering Local NIM waits, Hermes Anthropic routing, Nemotron 3 Ultra, the current Ollama starter fallback, and Spark managed-vLLM context length. - #4628, #4652, #4733, #4745 -> `docs/about/release-notes.mdx`, `docs/security/credential-storage.mdx`, `docs/manage-sandboxes/messaging-channels.mdx`, `docs/reference/troubleshooting.mdx`: Capture permission healing, gateway-stored credential reuse, cross-sandbox messaging credential conflict checks, and CDI preflight diagnostics. - #4728, #4737, #4743, #4744, #4782 -> `.agents/skills/nemoclaw-user-*`: Regenerate the user skill references from the updated source docs. - Follow-up maintenance -> `.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`, `.coderabbit.yaml`: Add release-prep area labels for docs and skills PRs, and teach docs review guidance that `$$nemoclaw` is the correct shared command placeholder for examples that work across agent aliases. Note: the `documentation` label was not present in the repository, so this PR is labeled with `v0.0.59` only.  ## Summary by CodeRabbit * **Documentation** * Updated default model for local Ollama inference setup to qwen3.5:9b * Added Nemotron 3 Ultra 550B as an NVIDIA Endpoints model option * Clarified credential storage and reuse behavior for post-deployment (day-two) operations * Added v0.0.59 release notes covering OpenClaw compatibility, inference options, Hermes messaging sync, and troubleshooting * Clarified CLI selection guidance and updated OpenClaw version example in status output * Revised release-prep instructions and docs review guidance for CLI alias usage

zyang-dev added 3 commits June 3, 2026 14:48

fix(onboard): detect stale or unrefreshed NVIDIA CDI specs in host pr…

445b616

…eflight Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

fix(onboard): evaluate CDI staleness on the effective spec only to av…

0df653a

…oid false-positive blocks Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

fix(install): remove invalid return from CDI repair probe

366f996

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

fix(onboard): keep CDI guard line-neutral in onboard entrypoint

9f74eef

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread scripts/install.sh

Comment thread src/lib/onboard.ts Outdated

Comment thread src/lib/onboard/preflight.ts Outdated

zyang-dev added 2 commits June 3, 2026 16:19

refactor(onboard): move CDI preflight helpers into docker-cdi module

34c7b5f

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

fix(onboard): guard stale CDI repair on toolkit and systemd availability

ead83f7

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

zyang-dev added the v0.0.58 Release target label Jun 3, 2026

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/lib/onboard/docker-cdi.ts

fix(onboard): quote CDI remediation command paths

529f1de

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

prekshivyas self-assigned this Jun 3, 2026

cv added v0.0.59 Release target and removed v0.0.58 Release target labels Jun 4, 2026

zyang-dev added 2 commits June 3, 2026 19:44

Updated CDI preflight expectations to match the newly shell-quoted re…

1279b23

…mediation command paths Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

refactor(onboard): move CDI host assessment into docker helper

4163765

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

prekshivyas approved these changes Jun 4, 2026

View reviewed changes

cv merged commit 8ecaf0a into main Jun 4, 2026
28 checks passed

cv deleted the fix/cdi-refresh-preflight branch June 4, 2026 17:46

miyoungc mentioned this pull request Jun 5, 2026

docs: refresh 0.0.59 release notes #4790

Merged

wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026

Conversation

zyang-dev commented Jun 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Advisor

🛠️ Needs attention

🔎 Worth checking

🌱 Nice ideas

Uh oh!

github-actions Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Scenario Advisor Recommendation

E2E Scenario Advisor

Required scenario E2E

Optional scenario E2E

Relevant changed files

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

prekshivyas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zyang-dev commented Jun 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

github-actions Bot commented Jun 3, 2026 •

edited

Loading

github-actions Bot commented Jun 3, 2026 •

edited

Loading

github-actions Bot commented Jun 3, 2026 •

edited

Loading