fix(onboard): harden GPU patch reconnect and rc cleanup against non-fatal failures#4777
Conversation
…atal failures Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Warning Review limit reached
More reviews will be available in 58 minutes and 43 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds supervisor-reconnect reconciliation and finalize/rollback for Docker GPU patches, bumps reconnect debounce default to 15 polls, exports a dockerStart helper, and extracts rc-file cleanup into a standalone Python helper included in the runtime image with tests and startup wiring updates. ChangesDocker GPU patch rollback and reconnect tuning
RC cleanup packaging and ownership guard
Sequence DiagramsequenceDiagram
participant Recreate as recreateOpenShellDockerSandboxWithGpu
participant DockerAdapter as Docker adapter
participant Supervisor as waitForOpenShellSupervisorReconnect
participant Finalize as finalizeDockerGpuPatchBackup
Recreate->>DockerAdapter: create GPU-enabled container (rename/backup → new)
DockerAdapter-->>Recreate: new container created
Recreate->>Supervisor: waitForOpenShellSupervisorReconnect (debounced)
alt supervisor reconnects
Supervisor-->>Recreate: execReady
Recreate->>DockerAdapter: docker rm backup container
DockerAdapter-->>Recreate: rm result
Recreate-->>Recreate: return success (backupRemoved=true)
else reconnect fails
Supervisor-->>Recreate: timeout/error
Recreate->>Finalize: finalizeDockerGpuPatchBackup({result, supervisorReady:false})
Finalize->>DockerAdapter: stop/remove new container
Finalize->>DockerAdapter: rename backup -> original
Finalize->>DockerAdapter: docker start original (dockerStart)
DockerAdapter-->>Finalize: start result
Finalize-->>Recreate: {backupRemoved:false, rolledBack}
Recreate-->>Recreate: record rolledBack in failure context and throw error
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 2 worth checking, 1 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard/docker-gpu-patch.ts`:
- Around line 896-903: The rollback currently treats dockerStart as
fire-and-forget and always returns true; update the rollback to capture the
result of d.dockerStart(refs.originalName, containerOpts), check it with
isZeroStatus (like the restored variable), and return false if dockerStart fails
(non-zero), otherwise return true; reference d.dockerRename/restored,
isZeroStatus, d.dockerStart, refs.backupContainerName and refs.originalName so
the change is applied to the same rollback flow and ensure any caller sees the
actual start failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8c35031f-d881-49ad-b95c-b752b4a841f6
📒 Files selected for processing (7)
scripts/nemoclaw-start.shsrc/lib/adapters/docker/container.tssrc/lib/onboard/docker-gpu-patch.test.tssrc/lib/onboard/docker-gpu-patch.tssrc/lib/onboard/docker-gpu-supervisor-reconnect.test.tssrc/lib/onboard/docker-gpu-supervisor-reconnect.tstest/nemoclaw-start.test.ts
…ainer Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26972019075
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard/docker-gpu-sandbox-create.ts`:
- Around line 157-161: Update the deferred finalize error message to explicitly
indicate rollback failure when finalizeOutcome.rolledBack is false: locate the
new Error(...) construction that uses finalizeOutcome.rolledBack and replace the
generic "did not reconnect" message in the rolledBack === false branch with a
clear message such as "OpenShell supervisor did not reconnect to the GPU-enabled
container and rollback failed; pre-patch sandbox was NOT restored" (or
equivalent wording) so operators can distinguish reconnect-only failures from
failed rollbacks; keep the rolledBack === true message unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 036a97c9-da7c-4dd1-a309-28633f77f5a0
📒 Files selected for processing (3)
src/lib/onboard/docker-gpu-patch.test.tssrc/lib/onboard/docker-gpu-patch.tssrc/lib/onboard/docker-gpu-sandbox-create.ts
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
🌿 Preview your docs: https://nvidia-preview-pr-4777.docs.buildwithfern.com/nemoclaw |
Selective E2E Results — ✅ All requested jobs passedRun: 26991352191
|
|
@laitingsheng can you address the findings and suggestions in #4777 (comment), please? Otherwise, looking great! |
…ocused module Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…oot rc ownership mismatch is fixture-tested Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 27003399558
|
… it against CodeQL findings Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 27005449457
|
…invariant, trim docker-gpu-patch test hotspot Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 27006820492
|
…inst advisor findings Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…nd drop issue refs from comments Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…d context Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 27013516426
|
Selective E2E Results — ✅ All requested jobs passedRun: 27013613190
|
prekshivyas
left a comment
There was a problem hiding this comment.
Re-reviewed against the current head (de04db9) — refreshing the earlier approval, which predated a substantial rework. Verified the substantive logic:
- Rollback path (
docker-gpu-patch-finalize.ts):reconcileSupervisorReconnectruns the reconnect probe before removing the backup, so a failed reconnect rolls back instead of leaving the deleted-backup/failed-new state from #4664.rollbackToBackupContainerordering is correct — stop+rm the new container (freeing the original name), rename backup→original (bail withfalseif that fails), start it — and the thrown error honestly reports whether the pre-patch sandbox was restored. On the healthy path, backup removal is best-effort with the realrmstatus surfaced for diagnostics. - rc-shim cleanup (
clean_runtime_shell_env_shim.py): the ownership guardif uid != 0 && st.st_uid != uid → skip(exit 0)is correct — it only leaves the legacy stanza when it genuinely cannot legally rewrite (not root, not owner), fixing the #4713 errexit crash. The script is otherwise hardened:O_NOFOLLOW(symlink refusal), regular-file check, post-clean verification that the shim is actually gone (exit 1 if it survives a rewritable case), TOCTOU-safe/proc/self/fdreopen withsame_filere-validation, and original-mode preservation. The threat-model note (leftover shim only sources the mode-444 proxy-env file) holds. - Shipping: the extracted script is staged via
build-context.ts, COPY'd to/usr/local/lib/nemoclaw/in the Dockerfile, and invoked from there bynemoclaw-start.sh(with a source-relative fallback for tests) — no missing-file risk.
The debounce bump (5→15 polls) is a straightforward tolerance increase. Tests cover the rollback success/restore-failure paths, the debounce default, and both rc-cleanup branches. CI green. Good to merge.
## Summary - Adds the `v0.0.60` section to `docs/about/release-notes.mdx` using the dev announcement from discussion #4877. - Fills the source-doc gaps found during release-prep review across inference, policy tiers, command behavior, security boundaries, Hermes dashboard/tooling, runtime context, and troubleshooting. - Refreshes generated agent skills under `.agents/skills/` from the current Fern docs output and upgrades Fern from `5.44.3` to `5.45.0`. ## Source summary - #4037 -> `docs/reference/architecture.mdx`, `docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents system-only runtime context that stays out of visible chat. - #4875 -> `docs/reference/architecture.mdx`, `docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents try-first sandbox network/filesystem guidance and clearer failure classification. - #4788 -> `docs/security/best-practices.mdx`, `docs/about/release-notes.mdx`: Documents shared OpenClaw device-approval policy for startup and connect. - #4768 -> `docs/reference/network-policies.mdx`, `docs/network-policy/integration-policy-examples.mdx`, `docs/get-started/quickstart.mdx`, `docs/get-started/quickstart-hermes.mdx`, `docs/reference/commands.mdx`: Documents `weather`, `public-reference`, and Hermes managed-tool gateway preset behavior. - #3788 and #4864 -> `docs/reference/network-policies.mdx`, `docs/reference/commands.mdx`: Documents non-interactive policy-tier fail-fast behavior and interactive prompt fallback. - #4756 and #4866 -> `docs/reference/commands.mdx`: Documents env-aware default sandbox resolution for `list`, `status`, and `tunnel` commands. - #4320 -> `docs/reference/commands.mdx`: Documents `$$nemoclaw tunnel status` behavior. - #4328 -> `docs/reference/commands.mdx`: Documents line-scoped policy preset descriptions in `policy-list`. - #4580 and #4748 -> `docs/reference/architecture.mdx`: Documents package-managed OpenShell gateway service and Docker-driver gateway-marker behavior. - #4598 -> `docs/manage-sandboxes/lifecycle.mdx`: Documents concurrent gateway/dashboard cleanup isolation by sandbox name and port. - #4777 -> `docs/reference/troubleshooting.mdx`: Documents Docker GPU patch rollback behavior. - #4610 -> `docs/reference/troubleshooting.mdx`, `docs/reference/commands.mdx`: Keeps mutable OpenClaw config permission guidance aligned and removes skipped experimental wording. - #4868 -> `docs/reference/commands.mdx`: Keeps `.dockerignore` handling for custom `onboard --from <Dockerfile>` contexts in generated skills. - #4870 -> `docs/reference/commands.mdx`, `docs/manage-sandboxes/runtime-controls.mdx`: Documents `NEMOCLAW_MINIMAL_BOOTSTRAP` and generated skill coverage. - #4641 -> `docs/inference/inference-options.mdx`, `docs/reference/troubleshooting.mdx`: Documents local NVIDIA NIM platform-digest pulls and served-model id adoption. - #4810 and #4867 -> `docs/inference/inference-options.mdx`: Documents stable NGC managed-vLLM image lineage and DGX Station DeepSeek V4 Flash coverage. - #4852 -> `docs/inference/use-local-inference.mdx`, `docs/reference/troubleshooting.mdx`: Documents Ollama model fit filtering, 16K context floor, cold-load retry, and failed-model exclusion. - #4847 -> `docs/inference/switch-inference-providers.mdx`: Documents API-family sync, Hermes `api_mode`, and Bedrock Runtime exception. - #4800 -> `docs/inference/tool-calling-reliability.mdx`: Documents Nemotron managed-inference native tool-search fallback. - #4333 -> `docs/inference/switch-inference-providers.mdx`: Documents interactive multimodal input prompting. - #4086 -> `docs/reference/troubleshooting.mdx`: Keeps proxy bypass normalization in generated troubleshooting coverage. - #4811 and #4855 -> `docs/get-started/quickstart-hermes.mdx`: Documents prebuilt Hermes dashboard assets and TUI recovery without runtime rebuilds. - #4854 -> `docs/inference/switch-inference-providers.mdx`, `docs/reference/commands.mdx`: Documents Hermes proxy API-key placeholder preservation during inference switches. - #4248 -> `docs/manage-sandboxes/messaging-channels.mdx`, `.agents/skills/`: Keeps messaging enrollment behavior aligned with manifest-hook implementation. - #4771 -> `docs/security/best-practices.mdx`, `docs/security/credential-storage.mdx`: Documents Hermes placeholder-only secret boundary for sandbox-visible runtime files. - #4787 -> `docs/security/best-practices.mdx`, `docs/about/release-notes.mdx`: Documents expanded memory scanner examples for OpenAI project keys and Slack app-level tokens. - #4848 -> `docs/reference/commands.mdx`: Documents OpenClaw skill install mirroring into the agent home directory. - #4790 -> `docs/about/release-notes.mdx`: Uses the prior release-prep structure and generated `.agents/skills/` refresh as the template for this release. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ skills/ --prefix nemoclaw-user --doc-platform fern-mdx --dry-run` - `npm run docs` - `git diff --check` - skip-term scan across `docs/`, `.agents/skills/`, and `skills/` - `npm run build:cli` - `npm run typecheck:cli` - Commit and pre-push hook suites, including markdownlint, gitleaks, env-var docs gate, docs-to-skills verification, and skills YAML tests <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * DeepSeek-V4-Flash now available as default inference model for DGX Station. * Hermes dashboard improved with dedicated port and OAuth-authenticated tool gateway selection. * Added weather and public-reference policy presets for expanded agent capabilities. * Enhanced Ollama model selection with GPU memory filtering and automatic retry for timeouts. * **Bug Fixes** * Improved policy tier validation to prevent invalid configurations. * Better sandbox cleanup scoping by port to prevent conflicts across deployments. * Added GPU patch failure recovery with automatic rollback. * **Documentation** * Expanded troubleshooting guides for inference, security, and sandbox lifecycle. * Added .dockerignore best practices for custom deployments. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
Two related hardening changes on the Docker GPU patch path.
The supervisor reconnect wait now gets a larger default debounce window. The patch also rolls back to the pre-patch container when reconnect fails. A transient or environment-specific reconnect issue no longer leaves the user with a deleted-backup, failed-new state.
The non-root rc shim cleanup at container start now exits cleanly when it cannot legally rewrite a root-owned rc file. Previously it crashed the container under errexit.
Related Issue
Fixes #4664
Fixes #4713
Changes
Raised
DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_PHASE_DEFAULT_DEBOUNCE_POLLSfrom 5 to 15. That is ~30 seconds of sustained Error tolerated instead of ~10 seconds. The original WSL2 repro for [WSL2 x86_64][Sandbox] OpenShell supervisor fails to reconnect to GPU-patched sandbox container; sandbox enters Error phase #4664 showed about 12 seconds ofsandbox listcache divergence after recreate, sitting at the edge of the old window on slower hosts.Reordered
recreateOpenShellDockerSandboxWithGpuso the supervisor reconnect wait runs before the backup container is removed. If the wait fails, a newrollbackToBackupContainerhelper takes over. It stops and removes the new GPU container, renames the backup back to the original name, then starts it. The thrown error message reports whether the rollback succeeded.Added a
dockerStartadapter alongside the existingdockerStopand friends. Threaded it throughDockerGpuPatchDepsanddepsWithDefaultsso the rollback path can issue the start.Extended
DockerGpuPatchFailureContextwith arolledBackflag. Surfaced it as arolled_back=line in the on-disk diag summary.Added an ownership guard to the
ensure_runtime_shell_env_shimPython heredoc inscripts/nemoclaw-start.sh. When the rc file is not owned by the current uid and we are not root, the cleanup now logs a[SECURITY]marker and exits 0. Previously it triedfchmodon a file it could not chmod, raised EPERM, and took the entire container down through errexit. This matches the failure path observed in sandbox container is fallen with exit code 1 when deploying NemoClaw with GPU #4713.Tests cover the new debounce default, the rollback success path, the rollback restore-failure path, and both the ownership-skip and owner-rewrite branches of the rc cleanup.
Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
Bug Fixes
Improvements
Tests
Documentation