Skip to content

fix(onboard): detect stale or unrefreshed NVIDIA CDI specs in host preflight#4733

Merged
cv merged 9 commits into
mainfrom
fix/cdi-refresh-preflight
Jun 4, 2026
Merged

fix(onboard): detect stale or unrefreshed NVIDIA CDI specs in host preflight#4733
cv merged 9 commits into
mainfrom
fix/cdi-refresh-preflight

Conversation

@zyang-dev

@zyang-dev zyang-dev commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

--gpus all resolves GPUs via the CDI spec, so a stale /etc/cdi/nvidia.yaml (e.g. an nvidia-uvm device major that drifted after a driver change) silently injects the wrong device and breaks CUDA in GPU containers (nvidia-smi works, cuInit fails). This adds host-preflight detection for a missing, stale, or unrefreshed NVIDIA CDI spec and guides the user to regenerate it / enable the nvidia-cdi-refresh units.

Related Issue

Fixes #4658

Changes

  • Check only the highest-precedence effective nvidia.com/gpu CDI spec when comparing declared /dev GPU device nodes with live device numbers.
  • Ignore stale /etc/cdi/nvidia.yaml leftovers when a fresh /var/run/cdi/nvidia.yaml overrides them.
  • Default omitted CDI device-node minor values to 0 and prefer hostPath for host-side stat checks.
  • Keep missing-spec remediation unchanged, including direct nvidia-ctk cdi generate fallback.
  • Split stale-spec remediation to use nvidia-cdi-refresh service refresh commands only, with optional manual leftover removal in displayed guidance.
  • Keep NVIDIA CDI refresh-unit health as a non-blocking warning unless the spec is missing or proven stale.
  • Move CDI preflight coverage into src/lib/onboard/preflight-cdi.test.ts and add regression coverage for precedence, stale/matching/absent devices, omitted-minor nodes, hostPath, refresh warnings, and stale auto-fix behavior.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Reproduced #4658 and manually verified that this PR fixes #4658


Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com

Summary by CodeRabbit

  • New Features

    • Detects stale NVIDIA CDI GPU device specs and attempts automatic refresh via the NVIDIA CDI refresh service during host preflight; provides manual guidance for non-systemd hosts.
  • Bug Fixes

    • Treats stale CDI GPU specs as blocking when CDI injection is configured.
    • Installer repair now prefers enabling/starting the NVIDIA refresh service (with elevation) and updates messaging to say specs may be "missing or stale" and are being refreshed.
  • Tests

    • Added/updated tests covering CDI detection, stale vs missing remediation plans, and installer repair behaviors.

zyang-dev added 3 commits June 3, 2026 14:48
…eflight

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
…oid false-positive blocks

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Detects and flags missing or stale NVIDIA CDI GPU specs, compares declared vs live device nodes, probes nvidia-cdi-refresh unit health, emits generate/refresh/warn remediation actions, performs installer-level refresh via systemd (with sudo when needed), and adds dedicated tests and harness updates.

Changes

CDI Stale Spec Detection and Refresh Remediation

Layer / File(s) Summary
docker-cdi module: types, parsing, detection, and command builders
src/lib/onboard/docker-cdi.ts, src/lib/onboard/docker-cdi.test.ts
Adds CDI-focused public types, spec-dir parsing, effective-spec selection, device-node collection, mismatch detection, and command/explanation builders for repair/refresh/warn; includes parsing, staleness detection, and remediation command tests.
Preflight assessment and remediation planner
src/lib/onboard/preflight.ts
Extends HostAssessment with CDI fields, probes nvidia-cdi-refresh systemd state, computes stale/mismatch/needsRepair flags, and updates planHostRemediation to emit generate/refresh/warn actions with toolkit/WSL/systemctl branching.
onboard: runtime assertion update
src/lib/onboard.ts
assertCdiNvidiaGpuSpecPresent treats missing OR stale CDI specs as fatal when CDI injection is configured and updates the fatal message text.
Installer-level stale-spec repair and messaging
scripts/install.sh
Adds repair_installer_stale_nvidia_cdi_spec(flagged_file) to enable/start nvidia-cdi-refresh (uses sudo when needed); repair_installer_nvidia_cdi_spec now distinguishes missing vs stale and delegates; CLI messaging updated to reference “missing or stale” / “Refreshing NVIDIA CDI device spec”.
Tests, harness updates, and test reorg
src/lib/onboard/preflight-cdi.test.ts, src/lib/onboard/docker-cdi.test.ts, src/lib/onboard/preflight.test.ts, test/install-preflight.test.ts
Adds dedicated CDI preflight and docker-cdi test suites; removes moved CDI blocks from preflight.test.ts; extends install-preflight harness to simulate stale state, pass CDI_STALE_FILE, and verify refresh-only repair behavior; updates assertions.

Sequence Diagram

sequenceDiagram
  participant CLI as assessHost
  participant DockerCdi as docker-cdi.ts
  participant Systemctl as systemctl
  participant Planner as planHostRemediation
  participant Installer as installer (scripts/install.sh)

  CLI->>DockerCdi: discover effective CDI spec / collect device nodes
  DockerCdi->>CLI: report mismatch / stale
  CLI->>Systemctl: probe nvidia-cdi-refresh (is-enabled/is-failed)
  CLI->>Planner: compute remediation (missing | stale | warn)
  Planner->>Installer: emit repair plan (generate / refresh / warn)
  Installer->>Systemctl: enable/start nvidia-cdi-refresh (sudo if needed)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4298: Modifies CDI remediation/validation decision points and WSL+Docker Desktop exemptions; related to CDI handling and installer repair flow.

Suggested labels

fix, Docker, bug-fix, Platform: Ubuntu

Suggested reviewers

  • cv

"I sniffed the stale CDI trace,
I nudged systemd to start the race,
I checked majors, minors, and specs,
I hopped to enable refresh checks,
Now GPUs wake with sunny grace!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: detecting stale or unrefreshed NVIDIA CDI specs in host preflight, which aligns with the core objective of adding staleness detection to prevent incorrect device node injection.
Linked Issues check ✅ Passed The PR implements the primary coding objective from #4658: detecting and handling stale NVIDIA CDI specs that cause incorrect device node injection and CUDA failures in containers. All major requirements are met.
Out of Scope Changes check ✅ Passed All changes are focused on NVIDIA CDI spec detection, staleness validation, and remediation. No out-of-scope modifications were introduced; changes remain targeted to the CDI preflight assessment and repair workflow.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cdi-refresh-preflight

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 1 needs attention, 6 worth checking, 0 nice ideas
Since last review: 1 prior item resolved, 4 still apply, 1 new item found

Review findings

🛠️ Needs attention

  • Add targeted runtime validation for the linked DGX Spark vLLM path: The linked issue's expected result is end-to-end runtime behavior: the vLLM container starts, serves the target model on port 8000, onboarding reaches inference verification, and status shows a running provider. This PR adds strong unit and installer-preflight coverage for CDI detection and repair, but the changed installer/onboarding/sandbox infrastructure does not include targeted runtime validation proving that the documented DGX Spark vLLM path is unblocked.
    • Recommendation: Add or identify targeted runtime/integration validation for the affected GPU/CDI path, such as confirming stale CDI preflight repair clears and the documented vLLM onboarding path reaches inference verification and a running provider status. Keep this separate from reporting external CI/E2E status.
    • Evidence: Issue [DGX Spark][Inference] PR #4619 vLLM/NVFP4 path crashes on container start with "CUDA unknown error" #4658 Expected Result: "Container `nemoclaw-vllm` starts cleanly, vLLM serves `nvidia/Qwen3.6-35B-A3B-NVFP4` on port 8000, onboard proceeds to `[4/8] Verifying inference` and beyond. `nemoclaw <name> status` shows running provider." Deterministic testDepth verdict: runtime_validation_recommended for `scripts/install.sh`, `src/lib/onboard.ts`, `src/lib/onboard/docker-cdi.ts`, and `src/lib/onboard/preflight.ts`.

🔎 Worth checking

  • Source-of-truth review needed: NVIDIA CDI missing/stale spec repair in installer and host preflight: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `repair_installer_stale_nvidia_cdi_spec`, `buildNvidiaCdiRepairCommands`, `buildStaleCdiWarnCommands`, `buildStaleCdiManualWarnCommands`, and `sudo nvidia-ctk cdi generate --output=...` implement localized recovery/fallback behavior.
  • Source-of-truth review needed: Stale CDI mismatch path transport: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `findCdiDeviceNodeMismatch()` returns `${node.filePath} ${node.path}=...`; `extractCdiMismatchFilePath()` and the installer node snippet both split at whitespace.
  • Preserve stale CDI mismatch file paths that contain whitespace (src/lib/onboard/docker-cdi.ts:274): The displayed remediation command builders shell-quote paths safely, but stale mismatch extraction still treats the first whitespace-delimited token as the file path. CDI spec directories are host filesystem paths, so a path containing spaces is truncated before it reaches remediation output or installer stale repair messaging.
    • Recommendation: Return structured mismatch data instead of encoding it as a whitespace-delimited string, or encode the file path in a reversible way. Add an end-to-end test that creates a stale mismatch from a CDI path containing spaces and verifies the full path reaches both TypeScript remediation output and the installer stale-repair path.
    • Evidence: `findCdiDeviceNodeMismatch()` returns `${node.filePath} ${node.path}=...`; `extractCdiMismatchFilePath()` slices at the first whitespace, and the installer probe in `scripts/install.sh` uses `mismatch.trim().split(/\s+/, 1)[0]`. Existing builder tests cover quoting `/tmp/cdi dir/...`, but not extraction from the mismatch string.
  • Avoid lossy host-path parsing before privileged CDI repair guidance (src/lib/onboard/docker-cdi.ts:374): Host-derived CDI paths flow into sudo/manual remediation guidance. The commands are shell-quoted, so this is not a confirmed command-injection issue, but the whitespace-delimited mismatch string loses path integrity before privileged repair guidance is generated.
    • Recommendation: Carry the stale CDI file path as a structured field through the assessment and installer repair probe, then keep shell-quoting at the final command-rendering boundary. Add a negative test with spaces and shell metacharacters in the effective stale spec path.
    • Evidence: `buildStaleCdiWarnCommands()` and `buildStaleCdiManualWarnCommands()` quote their `flaggedFilePath`, but `flaggedFilePath` is produced by `extractCdiMismatchFilePath()` from a whitespace-delimited string. The installer repeats the same split before calling `repair_installer_stale_nvidia_cdi_spec`.
  • Document the removal condition for the CDI refresh workaround (src/lib/onboard/docker-cdi.ts:391): The PR adds localized recovery behavior around stale or missing NVIDIA CDI specs and refresh-service health. The invalid state, source boundary, and regression tests are now mostly clear, but the code still does not state when this host workaround can be removed. Without a removal condition, host-workaround logic can become permanent source-of-truth drift.
    • Recommendation: Add a short comment or tracking note near the refresh/direct-generation remediation path that states the removal condition, for example when OpenShell/Docker no longer rely on CDI for GPU injection or NVIDIA Container Toolkit guarantees service-managed CDI specs are refreshed before Docker/OpenShell use them.
    • Evidence: Recovery paths include `repair_installer_stale_nvidia_cdi_spec`, `buildNvidiaCdiRepairCommands`, `buildStaleCdiWarnCommands`, `buildStaleCdiManualWarnCommands`, and direct `nvidia-ctk cdi generate` fallback guidance. Tests cover stale effective specs, refresh warnings, missing-spec fallback, toolkit absence, and WSL Docker Desktop exemption, but no removal condition is documented near the workaround.
  • Offset new CDI helper monolith growth (src/lib/onboard/docker-cdi.ts:1): The PR successfully moves CDI logic out of the already-large preflight file, but the new `docker-cdi.ts` helper grew into a 506-line module in one patch. This creates a new onboarding host-glue hotspot in a security-sensitive area that handles host filesystem parsing and privileged repair guidance.
    • Recommendation: Consider splitting parsing, host assessment, and remediation-command rendering into smaller modules or otherwise offset the growth before this becomes another hard-to-review monolith.
    • Evidence: Deterministic monolith evidence: `src/lib/onboard/docker-cdi.ts` baseLines 57, headLines 506, delta +449, severity blocker rationale: "Current monolith grew by 20 or more lines; extract or offset the growth before merge."

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Source-of-truth review needed: NVIDIA CDI missing/stale spec repair in installer and host preflight: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `repair_installer_stale_nvidia_cdi_spec`, `buildNvidiaCdiRepairCommands`, `buildStaleCdiWarnCommands`, `buildStaleCdiManualWarnCommands`, and `sudo nvidia-ctk cdi generate --output=...` implement localized recovery/fallback behavior.
  • Source-of-truth review needed: Stale CDI mismatch path transport: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `findCdiDeviceNodeMismatch()` returns `${node.filePath} ${node.path}=...`; `extractCdiMismatchFilePath()` and the installer node snippet both split at whitespace.
  • Add targeted runtime validation for the linked DGX Spark vLLM path: The linked issue's expected result is end-to-end runtime behavior: the vLLM container starts, serves the target model on port 8000, onboarding reaches inference verification, and status shows a running provider. This PR adds strong unit and installer-preflight coverage for CDI detection and repair, but the changed installer/onboarding/sandbox infrastructure does not include targeted runtime validation proving that the documented DGX Spark vLLM path is unblocked.
    • Recommendation: Add or identify targeted runtime/integration validation for the affected GPU/CDI path, such as confirming stale CDI preflight repair clears and the documented vLLM onboarding path reaches inference verification and a running provider status. Keep this separate from reporting external CI/E2E status.
    • Evidence: Issue [DGX Spark][Inference] PR #4619 vLLM/NVFP4 path crashes on container start with "CUDA unknown error" #4658 Expected Result: "Container `nemoclaw-vllm` starts cleanly, vLLM serves `nvidia/Qwen3.6-35B-A3B-NVFP4` on port 8000, onboard proceeds to `[4/8] Verifying inference` and beyond. `nemoclaw <name> status` shows running provider." Deterministic testDepth verdict: runtime_validation_recommended for `scripts/install.sh`, `src/lib/onboard.ts`, `src/lib/onboard/docker-cdi.ts`, and `src/lib/onboard/preflight.ts`.
  • Preserve stale CDI mismatch file paths that contain whitespace (src/lib/onboard/docker-cdi.ts:274): The displayed remediation command builders shell-quote paths safely, but stale mismatch extraction still treats the first whitespace-delimited token as the file path. CDI spec directories are host filesystem paths, so a path containing spaces is truncated before it reaches remediation output or installer stale repair messaging.
    • Recommendation: Return structured mismatch data instead of encoding it as a whitespace-delimited string, or encode the file path in a reversible way. Add an end-to-end test that creates a stale mismatch from a CDI path containing spaces and verifies the full path reaches both TypeScript remediation output and the installer stale-repair path.
    • Evidence: `findCdiDeviceNodeMismatch()` returns `${node.filePath} ${node.path}=...`; `extractCdiMismatchFilePath()` slices at the first whitespace, and the installer probe in `scripts/install.sh` uses `mismatch.trim().split(/\s+/, 1)[0]`. Existing builder tests cover quoting `/tmp/cdi dir/...`, but not extraction from the mismatch string.
  • Avoid lossy host-path parsing before privileged CDI repair guidance (src/lib/onboard/docker-cdi.ts:374): Host-derived CDI paths flow into sudo/manual remediation guidance. The commands are shell-quoted, so this is not a confirmed command-injection issue, but the whitespace-delimited mismatch string loses path integrity before privileged repair guidance is generated.
    • Recommendation: Carry the stale CDI file path as a structured field through the assessment and installer repair probe, then keep shell-quoting at the final command-rendering boundary. Add a negative test with spaces and shell metacharacters in the effective stale spec path.
    • Evidence: `buildStaleCdiWarnCommands()` and `buildStaleCdiManualWarnCommands()` quote their `flaggedFilePath`, but `flaggedFilePath` is produced by `extractCdiMismatchFilePath()` from a whitespace-delimited string. The installer repeats the same split before calling `repair_installer_stale_nvidia_cdi_spec`.
  • Document the removal condition for the CDI refresh workaround (src/lib/onboard/docker-cdi.ts:391): The PR adds localized recovery behavior around stale or missing NVIDIA CDI specs and refresh-service health. The invalid state, source boundary, and regression tests are now mostly clear, but the code still does not state when this host workaround can be removed. Without a removal condition, host-workaround logic can become permanent source-of-truth drift.
    • Recommendation: Add a short comment or tracking note near the refresh/direct-generation remediation path that states the removal condition, for example when OpenShell/Docker no longer rely on CDI for GPU injection or NVIDIA Container Toolkit guarantees service-managed CDI specs are refreshed before Docker/OpenShell use them.
    • Evidence: Recovery paths include `repair_installer_stale_nvidia_cdi_spec`, `buildNvidiaCdiRepairCommands`, `buildStaleCdiWarnCommands`, `buildStaleCdiManualWarnCommands`, and direct `nvidia-ctk cdi generate` fallback guidance. Tests cover stale effective specs, refresh warnings, missing-spec fallback, toolkit absence, and WSL Docker Desktop exemption, but no removal condition is documented near the workaround.
  • Offset new CDI helper monolith growth (src/lib/onboard/docker-cdi.ts:1): The PR successfully moves CDI logic out of the already-large preflight file, but the new `docker-cdi.ts` helper grew into a 506-line module in one patch. This creates a new onboarding host-glue hotspot in a security-sensitive area that handles host filesystem parsing and privileged repair guidance.
    • Recommendation: Consider splitting parsing, host assessment, and remediation-command rendering into smaller modules or otherwise offset the growth before this becomes another hard-to-review monolith.
    • Evidence: Deterministic monolith evidence: `src/lib/onboard/docker-cdi.ts` baseLines 57, headLines 506, delta +449, severity blocker rationale: "Current monolith grew by 20 or more lines; extract or offset the growth before merge."

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: cloud-onboard-e2e, gpu-e2e
Optional E2E: onboard-negative-paths-e2e, gpu-double-onboard-e2e

Dispatch hint: cloud-onboard-e2e,gpu-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • cloud-onboard-e2e (medium): Covers the installer plus non-interactive onboarding path on a clean Linux Docker environment, which is directly affected by install.sh and preflight/onboard changes even when the host is not a GPU/CDI machine.
  • gpu-e2e (high): Required because the PR changes NVIDIA GPU CDI detection/repair and the onboarding guard for OpenShell GPU startup. This job runs the real GPU/Ollama user flow on a GPU runner and is the closest existing E2E coverage for Docker CDI GPU passthrough into the sandbox.

Optional E2E

  • onboard-negative-paths-e2e (medium): Useful adjacent coverage for non-interactive onboarding validation and friendly failure paths after preflight error handling changed, but it does not specifically synthesize missing/stale NVIDIA CDI specs.
  • gpu-double-onboard-e2e (high): Optional extra confidence that repeated GPU onboarding still works after CDI repair/preflight changes; higher cost and not specifically targeted at stale CDI repair.

New E2E recommendations

  • GPU Docker CDI stale-spec repair (high): Existing E2E jobs exercise a normal GPU environment but do not deliberately create a stale effective nvidia.com/gpu CDI spec or unhealthy nvidia-cdi-refresh units, so they may miss regressions in the new stale-spec repair path and messaging.
    • Suggested test: Add a GPU-runner E2E that snapshots/restores host CDI state, injects a stale /etc/cdi or /var/run/cdi nvidia.yaml device-node mismatch, runs install/onboard preflight, verifies the refresh-service repair path is attempted, and confirms onboarding succeeds or blocks with actionable remediation.
  • installer CDI missing-spec repair (medium): Unit tests cover missing-spec command construction, but no existing E2E validates installer behavior when Docker advertises CDISpecDirs and nvidia.com/gpu is absent on a real host.
    • Suggested test: Add an installer preflight E2E fixture on a GPU/CDI-capable runner that temporarily hides the NVIDIA CDI spec, runs scripts/install.sh --non-interactive, and asserts the installer invokes the refresh/generate path and re-assesses the effective spec before onboarding.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: cloud-onboard-e2e,gpu-e2e

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw, gpu-repo-local-ollama-openclaw
Optional scenario E2E: ubuntu-no-docker-preflight-negative, wsl-repo-cloud-openclaw

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • ubuntu-repo-cloud-openclaw: Core installer/onboarding preflight code changed. The baseline Ubuntu repo-install OpenClaw scenario is the smallest standard scenario that exercises the repo-current install path plus onboarding preflight and sandbox startup on a normal Docker host.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw
  • gpu-repo-local-ollama-openclaw: The PR changes NVIDIA Docker CDI detection, stale/missing CDI spec remediation, and GPU preflight blocking behavior. This is the only dispatchable scenario routed through the gpu-docker-cdi runtime and therefore the only scenario path that directly exercises the changed CDI/GPU onboarding surface.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Optional scenario E2E

  • ubuntu-no-docker-preflight-negative: Optional adjacent coverage for preflight failure handling. The main changes are CDI/GPU-specific, but preflight assessment/remediation code was refactored, so the no-Docker negative scenario can catch regressions in early preflight failure classification without running the full sandbox path.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-no-docker-preflight-negative
  • wsl-repo-cloud-openclaw: Optional platform-adjacent coverage because the changed CDI preflight paths explicitly exempt WSL Docker Desktop runtime behavior. This uses a special Windows/WSL runner, so it is not primary unless WSL behavior is a release concern.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=wsl-repo-cloud-openclaw

Relevant changed files

  • scripts/install.sh
  • src/lib/onboard.ts
  • src/lib/onboard/docker-cdi.ts
  • src/lib/onboard/preflight.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/install.sh`:
- Around line 1903-1912: The stale-spec auto-repair branch currently runs when
host.cdiNvidiaGpuSpecStale && host.cdiNvidiaGpuSpecNeedsRepair &&
!host.cdiNvidiaGpuSpecMissing && !isWslDockerDesktopRuntime(host), which can
prompt sudo even when nvidia-container-toolkit is absent; add a requirement that
host.nvidiaContainerToolkitInstalled is true to that condition (i.e., gate the
stale branch on host.nvidiaContainerToolkitInstalled) so the installer defers to
the later "install toolkit first" remediation instead of attempting auto-repair.

In `@src/lib/onboard.ts`:
- Around line 1844-1849: The guard that checks host.cdiNvidiaGpuSpecNeedsRepair,
host.cdiNvidiaGpuSpecMissing and optedOutGpuPassthrough is verbose and inflates
the file; collapse it into a single compact conditional and keep behavior
identical (early return when not needing repair or when opted out). Then
compress the subsequent error logging/throwing block (the code that runs when
the check fails) by formatting the error message as a single-line template
string or using a short helper to build the message so you remove extra blank
lines and multi-line concatenation; update references to the same symbols
(host.cdiNvidiaGpuSpecNeedsRepair, host.cdiNvidiaGpuSpecMissing,
optedOutGpuPassthrough) and to the function that emits the error so behavior
remains unchanged.

In `@src/lib/onboard/preflight.ts`:
- Around line 1189-1191: The stale-spec branch must mirror the missing-spec
branch by checking assessment.systemctlAvailable before producing systemd-only
commands: change the repairCommands assignment so when missingSpec is false it
uses buildStaleCdiWarnCommands(flaggedFilePath) only if
assessment.systemctlAvailable is true, otherwise call a new or existing
non-systemd fallback (e.g., buildStaleCdiManualWarnCommands or return a
non-blocking warning action) that emits manual remediation steps instead of
systemctl commands; update references to repairCommands,
buildStaleCdiWarnCommands, buildNvidiaCdiRepairCommands,
assessment.systemctlAvailable, specPath and flaggedFilePath accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f820c13a-4139-4ddd-be6a-1414d00efb93

📥 Commits

Reviewing files that changed from the base of the PR and between 11fcbf7 and 366f996.

📒 Files selected for processing (6)
  • scripts/install.sh
  • src/lib/onboard.ts
  • src/lib/onboard/preflight-cdi.test.ts
  • src/lib/onboard/preflight.test.ts
  • src/lib/onboard/preflight.ts
  • test/install-preflight.test.ts
💤 Files with no reviewable changes (1)
  • src/lib/onboard/preflight.test.ts

Comment thread scripts/install.sh
Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard/preflight.ts Outdated
zyang-dev added 2 commits June 3, 2026 16:19
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@zyang-dev zyang-dev added the v0.0.58 Release target label Jun 3, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard/preflight-cdi.test.ts (1)

51-64: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make these assessHost cases fully hermetic.

These tests don't inject runCaptureImpl, so assessHost() still shells out for package-manager detection. That makes the suite depend on whatever tools are installed on the runner instead of just the fixture data. Add a shared no-op/stub runCaptureImpl for these cases so the unit tests stop touching the host environment. As per coding guidelines, **/*.test.{js,ts}: Mock external dependencies in unit tests; do not call real NVIDIA APIs in unit tests.

Also applies to: 71-86, 222-240

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/preflight-cdi.test.ts` around lines 51 - 64, The tests call
assessHost but don't inject runCaptureImpl, so package-manager detection still
shells out; add a hermetic stub by passing runCaptureImpl that returns a
deterministic capture object (e.g., { stdout: "", stderr: "", code: 0 } or
similar) to each assessHost invocation in this test file; update the assessHost
calls (the ones in this test and the other similar blocks) to include
runCaptureImpl: () => ({ stdout: "", stderr: "", code: 0 }) so the unit tests no
longer touch the real host or NVIDIA tools.
♻️ Duplicate comments (1)
src/lib/onboard/preflight.ts (1)

894-896: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep stale-spec remediation runnable without systemd.

When systemctlAvailable === false, the stale-spec path still emits only systemctl commands and then marks the action blocking. That leaves non-systemd Linux hosts with no executable repair path, even though the missing-spec path already has a fallback. Gate this branch the same way or downgrade it to a non-blocking/manual warning.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/preflight.ts` around lines 894 - 896, The stale-spec branch
currently always generates systemctl-only remediation via
buildStaleCdiWarnCommands and can be blocking on non-systemd hosts; modify the
logic that sets repairCommands so it checks systemctlAvailable the same way the
missingSpec path does (or make the stale-spec path non-blocking/manual when
systemctlAvailable === false). Specifically, when choosing between
buildNvidiaCdiRepairCommands and buildStaleCdiWarnCommands, gate the
buildStaleCdiWarnCommands call on systemctlAvailable (or replace it with a
non-systemd fallback/no-op warning and clear the blocking flag) using the
existing symbols repairCommands, systemctlAvailable, buildStaleCdiWarnCommands,
buildNvidiaCdiRepairCommands, missingSpec, and flaggedFilePath to locate and
implement the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/docker-cdi.ts`:
- Around line 257-272: The shell commands generated by
buildNvidiaCdiRepairCommands interpolate specDir and specPath raw into strings;
run specDir and specPath through the repo's shell-quoting helper before building
each command string so spaces or metacharacters are safely quoted, and do the
same for flaggedFilePath in the other remediation block referenced around lines
300-305—i.e., locate where flaggedFilePath is interpolated and replace direct
interpolation with the quoted value from the shell-quoting helper.

---

Outside diff comments:
In `@src/lib/onboard/preflight-cdi.test.ts`:
- Around line 51-64: The tests call assessHost but don't inject runCaptureImpl,
so package-manager detection still shells out; add a hermetic stub by passing
runCaptureImpl that returns a deterministic capture object (e.g., { stdout: "",
stderr: "", code: 0 } or similar) to each assessHost invocation in this test
file; update the assessHost calls (the ones in this test and the other similar
blocks) to include runCaptureImpl: () => ({ stdout: "", stderr: "", code: 0 })
so the unit tests no longer touch the real host or NVIDIA tools.

---

Duplicate comments:
In `@src/lib/onboard/preflight.ts`:
- Around line 894-896: The stale-spec branch currently always generates
systemctl-only remediation via buildStaleCdiWarnCommands and can be blocking on
non-systemd hosts; modify the logic that sets repairCommands so it checks
systemctlAvailable the same way the missingSpec path does (or make the
stale-spec path non-blocking/manual when systemctlAvailable === false).
Specifically, when choosing between buildNvidiaCdiRepairCommands and
buildStaleCdiWarnCommands, gate the buildStaleCdiWarnCommands call on
systemctlAvailable (or replace it with a non-systemd fallback/no-op warning and
clear the blocking flag) using the existing symbols repairCommands,
systemctlAvailable, buildStaleCdiWarnCommands, buildNvidiaCdiRepairCommands,
missingSpec, and flaggedFilePath to locate and implement the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a42d9078-1396-486d-a757-485dcbccdbbf

📥 Commits

Reviewing files that changed from the base of the PR and between 9f74eef and 34c7b5f.

📒 Files selected for processing (4)
  • src/lib/onboard/docker-cdi.test.ts
  • src/lib/onboard/docker-cdi.ts
  • src/lib/onboard/preflight-cdi.test.ts
  • src/lib/onboard/preflight.ts
✅ Files skipped from review due to trivial changes (1)
  • src/lib/onboard/docker-cdi.test.ts

Comment thread src/lib/onboard/docker-cdi.ts
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard/docker-cdi.ts (1)

337-379: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align user-facing explanation with remediation command (nemoclaw onboard) to avoid confusing OpenShell command references.
src/lib/onboard/docker-cdi.ts remediation tells users to rerun nemoclaw onboard (optionally --no-gpu), but the explanation strings attribute the failure to OpenShell’s gateway start --gpu, which is only an internal detail of the GPU passthrough path. Rewording to reference nemoclaw onboard GPU passthrough keeps the cause while matching the command users are actually told to run.

📝 Proposed wording alignment
-    "OpenShell's `gateway start --gpu` injects devices from the CDI spec, so a stale " +
+    "`nemoclaw onboard` (GPU passthrough) injects devices from the CDI spec, so a stale " +
   reasons.push(
-    "OpenShell's `gateway start --gpu` can fail until the CDI spec is refreshed and verified.",
+    "`nemoclaw onboard` GPU passthrough can fail until the CDI spec is refreshed and verified.",
   );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/docker-cdi.ts` around lines 337 - 379, Replace user-facing
references to "OpenShell's `gateway start --gpu`" with the user-facing
remediation command "nemoclaw onboard" (mentioning the optional --gpu flag where
appropriate) in the strings in this file: update the long explanatory return
string near the top of docker-cdi.ts and the final reasons.push in
explainNvidiaCdiRepairReason so they reference running "nemoclaw onboard" (or
"nemoclaw onboard --no-gpu"/"--gpu" as applicable) instead of the internal
OpenShell command; keep the rest of each message unchanged so the technical
detail and mitigation advice remain intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/lib/onboard/docker-cdi.ts`:
- Around line 337-379: Replace user-facing references to "OpenShell's `gateway
start --gpu`" with the user-facing remediation command "nemoclaw onboard"
(mentioning the optional --gpu flag where appropriate) in the strings in this
file: update the long explanatory return string near the top of docker-cdi.ts
and the final reasons.push in explainNvidiaCdiRepairReason so they reference
running "nemoclaw onboard" (or "nemoclaw onboard --no-gpu"/"--gpu" as
applicable) instead of the internal OpenShell command; keep the rest of each
message unchanged so the technical detail and mitigation advice remain intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 05a9f650-d750-47e3-93a9-56686c4d79b3

📥 Commits

Reviewing files that changed from the base of the PR and between ead83f7 and 529f1de.

📒 Files selected for processing (2)
  • src/lib/onboard/docker-cdi.test.ts
  • src/lib/onboard/docker-cdi.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/docker-cdi.test.ts

@prekshivyas prekshivyas self-assigned this Jun 3, 2026
@cv cv added v0.0.59 Release target and removed v0.0.58 Release target labels Jun 4, 2026
zyang-dev added 2 commits June 3, 2026 19:44
…mediation command paths

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>

@prekshivyas prekshivyas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — approving. This is a careful fix for a nasty silent failure (stale CDI → nvidia-smi works but cuInit fails). Verified the core logic:

  • Precedence: findEffectiveNvidiaCdiSpec iterates [...specDirs].reverse() and returns the first match, so the highest-precedence dir wins — Docker's CDISpecDirs is [/etc/cdi, /var/run/cdi], so a fresh /var/run/cdi/nvidia.yaml correctly overrides a stale /etc/cdi leftover. Matches the nvidia-cdi-refresh model.
  • Fail-safe comparison: findCdiDeviceNodeMismatch continues when the live device can't be stat'd and only flags when a /dev/* node genuinely exists with different major/minor (stat -c %t %T parsed as hex). Omitted minor defaults to 0, hostPath preferred. So it won't false-positive on absent/non-GPU nodes.
  • Gating: detection only applies on linux + NVIDIA GPU + docker-reported CDI dirs, and excludes WSL+docker-desktop. Missing/proven-stale → blocking: true (right call — better to halt than onboard a broken GPU sandbox); refresh-unit health alone is a non-blocking warning.
  • Shell safety: remediation commands quote dynamic paths via shellQuote (quotedSpecDir/quotedSpecPath/quotedFlaggedFilePath).
  • install.sh's stale auto-repair branch is gated on nvidiaContainerToolkitInstalled in the assessment emitter, so it won't fire nvidia-ctk/refresh on a toolkit-less host.

Tests cover precedence, stale/matching/absent devices, omitted-minor, hostPath, refresh warnings, and stale auto-fix; CI is green.

Re CodeRabbit: its onboard.ts growth-guardrail "Major" is stale (codebase-growth-guardrails passes now), and the quote-paths finding is already handled by the shellQuote calls above — both can be disregarded. One courtesy double-check, non-blocking: confirm the systemctlAvailable === false stale path (CodeRabbit @ preflight.ts:1191) renders the manual-removal guidance you intend — the code does branch to buildStaleCdiManualWarnCommands, so it looks right; just worth an eyeball. I did not run the suite locally or exhaustively trace the manual path.

@cv cv merged commit 8ecaf0a into main Jun 4, 2026
28 checks passed
@cv cv deleted the fix/cdi-refresh-preflight branch June 4, 2026 17:46
cv pushed a commit that referenced this pull request Jun 5, 2026
## Summary
- Add the v0.0.59 release notes from the GitHub announcement discussion.
- Refresh local inference and credential-storage guidance for the
current release behavior.
- Regenerate the user skills from the updated Fern docs.
- Tighten release-prep and docs review guidance for generated skills, PR
labels, and shared `$$nemoclaw` command placeholders.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" --glob '*.{md,mdx}'`
- `git diff --check`
- `npm run docs` (rerun outside sandbox after sandbox-only `tsx` IPC
permission failure)
- `npm run typecheck:cli`
- Pre-commit hooks during commit passed, including markdownlint,
docs-to-skills verification, gitleaks, commitlint, and skills YAML
tests.

## Source Summary
- #3679, #4437, #4681, #4766, #4772, #4775, #4786 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`,
`docs/reference/troubleshooting.mdx`: Summarize OpenClaw 2026.5.27
compatibility, runtime path pinning, plugin registry recovery, live
gateway reconciliation, and clearer host-alias/startup diagnostics.
- #4332, #4402, #4769, #4776, #4779 -> `docs/about/release-notes.mdx`,
`docs/inference/inference-options.mdx`,
`docs/inference/use-local-inference.mdx`,
`docs/inference/switch-inference-providers.mdx`: Document the release
inference changes covering Local NIM waits, Hermes Anthropic routing,
Nemotron 3 Ultra, the current Ollama starter fallback, and Spark
managed-vLLM context length.
- #4628, #4652, #4733, #4745 -> `docs/about/release-notes.mdx`,
`docs/security/credential-storage.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`,
`docs/reference/troubleshooting.mdx`: Capture permission healing,
gateway-stored credential reuse, cross-sandbox messaging credential
conflict checks, and CDI preflight diagnostics.
- #4728, #4737, #4743, #4744, #4782 -> `.agents/skills/nemoclaw-user-*`:
Regenerate the user skill references from the updated source docs.
- Follow-up maintenance ->
`.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`,
`.coderabbit.yaml`: Add release-prep area labels for docs and skills
PRs, and teach docs review guidance that `$$nemoclaw` is the correct
shared command placeholder for examples that work across agent aliases.

Note: the `documentation` label was not present in the repository, so
this PR is labeled with `v0.0.59` only.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
  * Updated default model for local Ollama inference setup to qwen3.5:9b
  * Added Nemotron 3 Ultra 550B as an NVIDIA Endpoints model option
* Clarified credential storage and reuse behavior for post-deployment
(day-two) operations
* Added v0.0.59 release notes covering OpenClaw compatibility, inference
options, Hermes messaging sync, and troubleshooting
* Clarified CLI selection guidance and updated OpenClaw version example
in status output
* Revised release-prep instructions and docs review guidance for CLI
alias usage
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression v0.0.59 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DGX Spark][Inference] PR #4619 vLLM/NVFP4 path crashes on container start with "CUDA unknown error"

4 participants