fix(onboard): prefer CDI GPU mode over --gpus on CDI hosts#4956
fix(onboard): prefer CDI GPU mode over --gpus on CDI hosts#4956jason-ma-nv wants to merge 2 commits into
Conversation
On Docker-driver GPU hosts that advertise an NVIDIA CDI spec (e.g. /etc/cdi/nvidia.yaml on Ubuntu 24.04/26.04), `nemoclaw onboard` selected the `--gpus all` mode for the GPU patch recreate. `docker create --gpus all` is accepted on these hosts so the create-only probe passed, but OpenShell's `gateway start --gpu` injects devices from the CDI spec, so a container recreated via the legacy --gpus path diverges from how the supervisor expects the GPU container to be wired up. The supervisor never reconnected, the sandbox entered Error phase before the GPU proof, and onboard aborted with exit 1. Reorder GPU mode candidates so the CDI mode (`--device nvidia.com/gpu=all`) is preferred ahead of --gpus whenever a CDI spec is detected; --gpus and the NVIDIA runtime remain as fallbacks if the CDI probe fails. Non-CDI hosts are unaffected (candidate order unchanged). Note: the supervisor-reconnect failure is a runtime symptom on real GPU hardware; final validation requires the GPU E2E path (see "verify" below). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughReorders GPU patch mode probing to prefer CDI when Docker reports a readable NVIDIA CDI spec, reorganizes related re-exports, and adds tests validating CDI-first selection, fallback to --gpus and nvidia-runtime, and propagation of the CDI device flag into recreate flows. ChangesCDI-first GPU patch mode candidate selection
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
PR Review AdvisorFindings: 1 needs attention, 3 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
…4948) Address PR review-advisor findings on the CDI-first GPU mode change: - Move the CDI mode-selection tests out of the docker-gpu-patch.test.ts monolith into a focused docker-gpu-patch-mode-selection.test.ts spec, offsetting the flagged monolith growth (back to ~baseline line count). - Pin the fallback chain on a CDI host: CDI probe fails -> --gpus selected; CDI and --gpus probes fail -> NVIDIA runtime selected (attempt order starts with cdi in both cases). - Add a recreate-boundary assertion: recreateOpenShellDockerSandboxWithGpu passes --device nvidia.com/gpu=all to dockerRunDetached on a CDI host and never emits --gpus, proving the selected CDI mode reaches the real recreate command (the patched_create_option the issue logs). All four new tests fail under the previous --gpus-first ordering, confirming they pin the fix rather than restating it. No production code change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
On Docker-driver GPU hosts that advertise an NVIDIA CDI spec (e.g.
/etc/cdi/nvidia.yamlon Ubuntu 24.04/26.04),nemoclaw onboardselected--gpus allfor the GPU patch recreate, the OpenShell supervisor never reconnected, the sandbox entered Error phase before the GPU proof, and onboard aborted with exit 1. This reorders GPU mode selection to prefer the CDI mode (--device nvidia.com/gpu=all) ahead of--gpuswhenever a CDI spec is detected, matching how OpenShell'sgateway start --gpuinjects GPU devices.Related Issue
Fixes #4948
Changes
src/lib/onboard/docker-gpu-patch.ts:buildDockerGpuModeCandidatesnow puts thecdicandidate first whencdiAvailableis true;--gpusand the NVIDIA runtime remain as fallbacks if the CDI probe fails. Non-CDI hosts are unaffected (order unchanged:gpus,nvidia-runtime). Jetson path unchanged.src/lib/onboard/docker-gpu-patch.test.ts: adds a repro test asserting that on a CDI host where the--gpusprobe would pass, thecdimode is selected; updates the existing candidate-order assertion and stale comments to the corrected ordering.Why this is the right layer
The create-only probe (
docker create --gpus all) is accepted on these hosts, so--gpus alllooked viable but diverges at runtime from OpenShell's CDI-based injection — the supervisor then never reconnects to the recreated container. Preferring CDI when a spec is present removes that divergence.Validation caveat
The supervisor-reconnect failure is a runtime symptom that only manifests on real GPU + Docker-CDI hardware. Unit tests pin the deterministic mode-selection decision; final confirmation requires the GPU E2E path (
e2e-branch-validation:gpu) on an affected host. The existingNEMOCLAW_DOCKER_GPU_PATCH=0escape hatch is unchanged.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Verification notes: the full onboard suite passes (
npx vitest run src/lib/onboard/— 1189 tests, including the new #4948 repro),npm run typecheck:clipasses, and Biome is clean on the changed files. The fullnpm testrun has 16 pre-existing, environment-dependent failures (network/port-binding e2e-framework fixtures, aMODULE_NOT_FOUNDinssrf-parity, and missing Docker/OpenShell fixtures infetch-guard-patch-regression) that are unrelated to this change and do not reference the modified module — hencenpm testis left unchecked.Signed-off-by: Jason Ma jama@nvidia.com
Summary by CodeRabbit
Bug Fixes
Tests