Skip to content

fix(onboard): harden GPU patch reconnect and rc cleanup against non-fatal failures#4777

Merged
cv merged 13 commits into
mainfrom
fix/4664-4713-gpu-patch-hardening
Jun 5, 2026
Merged

fix(onboard): harden GPU patch reconnect and rc cleanup against non-fatal failures#4777
cv merged 13 commits into
mainfrom
fix/4664-4713-gpu-patch-hardening

Conversation

@laitingsheng

@laitingsheng laitingsheng commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Two related hardening changes on the Docker GPU patch path.

The supervisor reconnect wait now gets a larger default debounce window. The patch also rolls back to the pre-patch container when reconnect fails. A transient or environment-specific reconnect issue no longer leaves the user with a deleted-backup, failed-new state.

The non-root rc shim cleanup at container start now exits cleanly when it cannot legally rewrite a root-owned rc file. Previously it crashed the container under errexit.

Related Issue

Fixes #4664
Fixes #4713

Changes

  • Raised DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_PHASE_DEFAULT_DEBOUNCE_POLLS from 5 to 15. That is ~30 seconds of sustained Error tolerated instead of ~10 seconds. The original WSL2 repro for [WSL2 x86_64][Sandbox] OpenShell supervisor fails to reconnect to GPU-patched sandbox container; sandbox enters Error phase #4664 showed about 12 seconds of sandbox list cache divergence after recreate, sitting at the edge of the old window on slower hosts.

  • Reordered recreateOpenShellDockerSandboxWithGpu so the supervisor reconnect wait runs before the backup container is removed. If the wait fails, a new rollbackToBackupContainer helper takes over. It stops and removes the new GPU container, renames the backup back to the original name, then starts it. The thrown error message reports whether the rollback succeeded.

  • Added a dockerStart adapter alongside the existing dockerStop and friends. Threaded it through DockerGpuPatchDeps and depsWithDefaults so the rollback path can issue the start.

  • Extended DockerGpuPatchFailureContext with a rolledBack flag. Surfaced it as a rolled_back= line in the on-disk diag summary.

  • Added an ownership guard to the ensure_runtime_shell_env_shim Python heredoc in scripts/nemoclaw-start.sh. When the rc file is not owned by the current uid and we are not root, the cleanup now logs a [SECURITY] marker and exits 0. Previously it tried fchmod on a file it could not chmod, raised EPERM, and took the entire container down through errexit. This matches the failure path observed in sandbox container is fallen with exit code 1 when deploying NemoClaw with GPU #4713.

  • Tests cover the new debounce default, the rollback success path, the rollback restore-failure path, and both the ownership-skip and owner-rewrite branches of the rc cleanup.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • GPU patch rollback now preserves backups and reliably restores them if supervisor reconnect fails; backups removed only after successful reconnect (or reported when not).
  • Improvements

    • Increased supervisor-reconnect debounce window (default ~30s) for more stable GPU patching.
    • Entrypoint cleanup now delegates to an installed runtime helper (with configurable fallback) and ships that helper in the runtime image.
    • Added a helper to start Docker containers programmatically.
  • Tests

    • New and expanded tests covering rc cleanup, finalize/rollback and composed GPU patch flows.
  • Documentation

    • Updated troubleshooting guidance to reflect new debounce defaults.

…atal failures

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@laitingsheng, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 58 minutes and 43 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9098dfb5-8f59-4ec5-a02b-3fef337e5615

📥 Commits

Reviewing files that changed from the base of the PR and between 756d64c and de04db9.

📒 Files selected for processing (2)
  • src/lib/sandbox/build-context.ts
  • test/sandbox-build-context.test.ts
📝 Walkthrough

Walkthrough

Adds supervisor-reconnect reconciliation and finalize/rollback for Docker GPU patches, bumps reconnect debounce default to 15 polls, exports a dockerStart helper, and extracts rc-file cleanup into a standalone Python helper included in the runtime image with tests and startup wiring updates.

Changes

Docker GPU patch rollback and reconnect tuning

Layer / File(s) Summary
Docker start helper
src/lib/adapters/docker/container.ts
dockerStart(containerName, opts) exported as a wrapper around dockerRun(["start", ...]) used by finalize/rollback logic.
Supervisor reconnect debounce tuning
src/lib/onboard/docker-gpu-supervisor-reconnect.ts, src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts, docs/reference/troubleshooting.mdx, .agents/..., skills/...
Default consecutive Error-phase debounce polls increased from 5→15; tests and troubleshooting docs updated to reflect new default and example override guidance.
Finalize / rollback API and impl
src/lib/onboard/docker-gpu-patch-finalize.ts, src/lib/onboard/docker-gpu-patch.ts
Adds rollbackToBackupContainer, finalizeDockerGpuPatchBackup, and reconcileSupervisorReconnect with Docker-dep resolution and zero-status handling; updates types and default deps.
Sandbox create: finalize wiring & overrides
src/lib/onboard/docker-gpu-sandbox-create.ts, src/lib/onboard/docker-gpu-patch.ts
Wires finalize into recreate/create failure path, adds overrideable test seams, extends patch result/context (originalName, backupRemoved, rolledBack), and persists rolled_back to diagnostics.
GPU patch tests and ordering assertions
src/lib/onboard/docker-gpu-patch.test.ts, src/lib/onboard/docker-gpu-patch-finalize.test.ts, src/lib/onboard/docker-gpu-patch-rollback.test.ts, src/lib/onboard/docker-gpu-sandbox-create.test.ts
Adds/updates tests covering backup removal ordering, deferred cleanup, reconnect-failure rollback, restore/start failure cases, finalize outcomes, and composed create flow behavior.

RC cleanup packaging and ownership guard

Layer / File(s) Summary
Packaged rc cleanup script and image wiring
scripts/lib/clean_runtime_shell_env_shim.py, scripts/nemoclaw-start.sh, Dockerfile
Extracts inline heredoc cleanup to clean_runtime_shell_env_shim.py, resolves helper by precedence (/usr/local/lib/nemoclaw/...NEMOCLAW_RC_CLEAN_SCRIPTscripts/lib/...), calls it with (rc_file, shim_text, uid), includes script in image, and updates permissions. Script performs secure open, ownership checks, in-place rewrite or atomic replace, and strict error reporting.
RC cleanup tests and startup wrapper updates
test/clean-runtime-shell-env-shim.test.ts, test/nemoclaw-start.test.ts, test/service-env.test.ts, test/sandbox-provisioning.test.ts
Adds Vitest coverage for owned-file cleanup, ownership-skip diagnostics, symlink refusal, idempotence, and partial-stanza removal; updates startup-snippet extraction and test wrapper headers; adds composed regression test for uid-mismatch verifying proxy-env.sh mode 444 and unchanged .bashrc.

Sequence Diagram

sequenceDiagram
  participant Recreate as recreateOpenShellDockerSandboxWithGpu
  participant DockerAdapter as Docker adapter
  participant Supervisor as waitForOpenShellSupervisorReconnect
  participant Finalize as finalizeDockerGpuPatchBackup

  Recreate->>DockerAdapter: create GPU-enabled container (rename/backup → new)
  DockerAdapter-->>Recreate: new container created
  Recreate->>Supervisor: waitForOpenShellSupervisorReconnect (debounced)
  alt supervisor reconnects
    Supervisor-->>Recreate: execReady
    Recreate->>DockerAdapter: docker rm backup container
    DockerAdapter-->>Recreate: rm result
    Recreate-->>Recreate: return success (backupRemoved=true)
  else reconnect fails
    Supervisor-->>Recreate: timeout/error
    Recreate->>Finalize: finalizeDockerGpuPatchBackup({result, supervisorReady:false})
    Finalize->>DockerAdapter: stop/remove new container
    Finalize->>DockerAdapter: rename backup -> original
    Finalize->>DockerAdapter: docker start original (dockerStart)
    DockerAdapter-->>Finalize: start result
    Finalize-->>Recreate: {backupRemoved:false, rolledBack}
    Recreate-->>Recreate: record rolledBack in failure context and throw error
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

fix, Sandbox, Docker, area: onboarding, platform: container

Suggested reviewers

  • prekshivyas
  • cv

"🐰 A rollback hop when reconnects stall,
The backup waits — it steadies all.
Fifteen beats the watchful drum,
Start the old, or let the new one hum.
RC guards keep files safe for all."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'fix(onboard): harden GPU patch reconnect and rc cleanup against non-fatal failures' accurately and concisely describes the main changes: increasing reconnect tolerance, adding rollback support, and hardening rc cleanup against non-fatal failures.
Linked Issues check ✅ Passed The PR comprehensively addresses both linked issues #4664 and #4713: increases debounce polls for WSL2 timing race (#4664), implements rollback path when supervisor reconnect fails, adds dockerStart adapter, hardens rc cleanup with ownership guard against non-fatal failures (#4713), and includes test coverage for all hardened paths.
Out of Scope Changes check ✅ Passed All changes are directly scoped to addressing GPU patch reconnect hardening and rc cleanup ownership issues. Modifications include debounce constant updates, rollback logic, supervisorReconnect reconciliation, rc cleanup safety, Docker adapter additions, and corresponding test coverage with no unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/4664-4713-gpu-patch-hardening

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: test-non-root-sandbox-smoke, test-e2e-sandbox, cloud-e2e, openclaw-onboard-security-posture-e2e, gpu-e2e
Optional E2E: gpu-double-onboard-e2e, sandbox-operations-e2e, runtime-overrides-e2e, test-e2e-gateway-isolation

Dispatch hint: cloud-e2e,openclaw-onboard-security-posture-e2e,gpu-e2e

Auto-dispatched E2E: cloud-e2e, openclaw-onboard-security-posture-e2e, gpu-e2e via nightly-e2e.yaml at de04db9e067397c0d30b2e1ddfd160cd2ada3fa5nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • test-non-root-sandbox-smoke (low): Validates the production sandbox entrypoint under --security-opt no-new-privileges. This is the most direct existing E2E guard for the new rc cleanup helper and non-root startup path in nemoclaw-start.sh.
  • test-e2e-sandbox (medium): Builds the changed Dockerfile into a sandbox image and runs the in-container sandbox/blueprint smoke suite, catching Dockerfile copy/chmod/build-context regressions before merge.
  • cloud-e2e (high): Runs the full install → onboard → sandbox verify → live inference user journey. Required because the PR changes sandbox startup, provisioning/build context, and Docker-backed onboarding behavior that can break non-GPU users too.
  • openclaw-onboard-security-posture-e2e (high): Full OpenClaw onboarding with non-root host/security-posture assertions is required for the rc-file cleanup/security-boundary change and proxy-env trust invariant.
  • gpu-e2e (high): Exercises real local Ollama GPU onboarding on a GPU runner. Required because the PR changes the Docker GPU patch/create path, supervisor reconnect wait, backup removal, and rollback-adjacent lifecycle behavior.

Optional E2E

  • gpu-double-onboard-e2e (high): Useful adjacent coverage for repeated Ollama/GPU onboarding and token/proxy consistency after the GPU patch changes, but the core success path is covered by gpu-e2e.
  • sandbox-operations-e2e (high): Adds confidence for general sandbox lifecycle operations, multi-sandbox behavior, and gateway recovery after changes to sandbox provisioning/build context and container lifecycle helpers.
  • runtime-overrides-e2e (medium): Adjacent confidence for runtime environment propagation after moving legacy shell-env cleanup into an installed Python helper.
  • test-e2e-gateway-isolation (medium): Optional image-level security boundary check because Dockerfile and startup environment wiring changed, though the PR does not directly alter network isolation policy.

New E2E recommendations

  • Docker GPU rollback on supervisor reconnect failure (high): Existing GPU E2E validates successful GPU onboarding, while the new rollback/finalize behavior is primarily unit-tested. Add an E2E that forces or simulates post-recreate supervisor reconnect failure and asserts the backup container is restored, the backup is not prematurely removed, and diagnostics report rolled_back correctly.
    • Suggested test: docker-gpu-supervisor-reconnect-rollback-e2e
  • legacy rc shim cleanup with mismatched ownership (medium): Unit tests cover the helper, and non-root smoke covers startup, but there is no full image E2E fixture that starts with a legacy /tmp/nemoclaw-proxy-env.sh rc stanza in a mismatched-owner rc file and verifies startup remains safe and proxy env file permissions stay hardened.
    • Suggested test: legacy-runtime-shell-env-shim-mismatched-owner-e2e

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: cloud-e2e,openclaw-onboard-security-posture-e2e,gpu-e2e

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw, gpu-repo-local-ollama-openclaw
Optional scenario E2E: wsl-repo-cloud-openclaw

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • ubuntu-repo-cloud-openclaw: Primary Ubuntu repo-checkout Docker scenario exercises Dockerfile/build-context changes plus sandbox startup via scripts/nemoclaw-start.sh and the new rc-cleanup helper during normal onboarding and smoke validation.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw
  • gpu-repo-local-ollama-openclaw: Only routed scenario that exercises the NVIDIA GPU Docker runtime path affected by docker-gpu-patch, supervisor reconnect, deferred backup finalization, rollback, and dockerStart adapter changes.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Optional scenario E2E

  • wsl-repo-cloud-openclaw: Optional adjacent platform coverage for sandbox startup and Docker Desktop/WSL behavior; useful because the changed startup and GPU reconnect documentation mention slow WSL2 + Docker Desktop setups, but the routed WSL scenario does not exercise the GPU patch path directly.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=wsl-repo-cloud-openclaw

Relevant changed files

  • Dockerfile
  • scripts/lib/clean_runtime_shell_env_shim.py
  • scripts/nemoclaw-start.sh
  • src/lib/adapters/docker/container.ts
  • src/lib/onboard/docker-gpu-patch-finalize.ts
  • src/lib/onboard/docker-gpu-patch.ts
  • src/lib/onboard/docker-gpu-sandbox-create.ts
  • src/lib/onboard/docker-gpu-supervisor-reconnect.ts
  • src/lib/sandbox/build-context.ts

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 2 worth checking, 1 nice ideas
Since last review: 1 prior item resolved, 3 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Runtime validation is still important for the changed sandbox lifecycle paths: The PR adds useful unit and composed tests, including rc cleanup fixture tests and mocked Docker rollback/finalize tests. However, the changed behavior depends on real Docker container state, OpenShell supervisor reconnect timing, rollback side effects, production image helper precedence, root/non-root file ownership, and /tmp trust-boundary permissions. Those interactions are security-sensitive and not fully proven by mocks or extracted shell snippets.
    • Recommendation: Add or identify targeted runtime validation for the behavior-specific cases listed in testDepth.suggestedTests, focused on actual Docker/OpenShell and ownership behavior rather than external job status.
    • Evidence: Changed surfaces include scripts/nemoclaw-start.sh, scripts/lib/clean_runtime_shell_env_shim.py, Dockerfile, src/lib/onboard/docker-gpu-patch-finalize.ts, src/lib/onboard/docker-gpu-patch.ts, src/lib/onboard/docker-gpu-sandbox-create.ts, and src/lib/onboard/docker-gpu-supervisor-reconnect.ts. Tests cover helper and rollback contracts mostly through fixtures/mocks.
  • [WSL2 x86_64][Sandbox] OpenShell supervisor fails to reconnect to GPU-patched sandbox container; sandbox enters Error phase #4664 Health=starting retry clause is addressed by an alternate design (src/lib/onboard/docker-gpu-supervisor-reconnect.ts:39): One [WSL2 x86_64][Sandbox] OpenShell supervisor fails to reconnect to GPU-patched sandbox container; sandbox enters Error phase #4664 suggested fix says to retry reconnect when the patched Docker container is `Status=running` plus `Health=starting`. This PR intentionally implements a generic Error-phase debounce plus rollback path instead; that may be acceptable, but the reconnect wait still does not inspect or test the exact Docker State/Health condition from the issue.

🌱 Nice ideas

  • Deferred create success path still drops backup cleanup failure status (src/lib/onboard/docker-gpu-sandbox-create.ts:184): The helper now returns `backupRemoved=false` when backup removal fails after a successful supervisor reconnect, but the deferred create caller returns immediately on `supervisorReady` without surfacing or logging that incomplete cleanup state. The sandbox is healthy, so this is not fatal, but it can leave a leaked backup container without a clear diagnostic from this path.
    • Recommendation: Consider logging or otherwise preserving `finalizeOutcome.backupRemoved=false` on the `supervisorReady=true` path, and add a behavior test where the deferred create flow succeeds but backup `docker rm` fails.
    • Evidence: finalizeDockerGpuPatchBackup returns `{ backupRemoved: false, rolledBack: false }` when docker rm fails, while waitForSupervisorReconnectIfNeeded calls finalizeBackup and then immediately returns when `supervisorReady` is true.
Consider writing more tests for
  • **Runtime validation** — Runtime/image: installed `/usr/local/lib/nemoclaw/clean_runtime_shell_env_shim.py` exists while `NEMOCLAW_RC_CLEAN_SCRIPT` points to a failing executable, and `ensure_runtime_shell_env_shim` still uses the installed helper.. The PR adds meaningful unit and composed tests, but the changed runtime/sandbox/infrastructure behavior still depends on real Docker daemon semantics, OpenShell supervisor timing, container rollback side effects, production image staging, root/non-root file ownership, and /tmp trust-boundary permissions.
  • **Runtime validation** — Runtime/image: non-root `nemoclaw-start` with a root-owned `.bashrc` containing the legacy shim exits 0 and leaves `/tmp/nemoclaw-proxy-env.sh` non-user-writable or mode 444 where privilege separation applies.. The PR adds meaningful unit and composed tests, but the changed runtime/sandbox/infrastructure behavior still depends on real Docker daemon semantics, OpenShell supervisor timing, container rollback side effects, production image staging, root/non-root file ownership, and /tmp trust-boundary permissions.
  • **Runtime validation** — Real Docker/OpenShell: after a replacement GPU container starts but supervisor reconnect fails, the backup container is restored under the original name and started.. The PR adds meaningful unit and composed tests, but the changed runtime/sandbox/infrastructure behavior still depends on real Docker daemon semantics, OpenShell supervisor timing, container rollback side effects, production image staging, root/non-root file ownership, and /tmp trust-boundary permissions.
  • **Runtime validation** — Real Docker/OpenShell: supervisor reconnect succeeds but backup `docker rm` fails, and the caller logs or persists that backup cleanup was incomplete.. The PR adds meaningful unit and composed tests, but the changed runtime/sandbox/infrastructure behavior still depends on real Docker daemon semantics, OpenShell supervisor timing, container rollback side effects, production image staging, root/non-root file ownership, and /tmp trust-boundary permissions.
  • **Runtime validation** — Unit behavior: `waitForOpenShellSupervisorReconnect` sees 14 consecutive Error polls followed by Ready at poll 15 and succeeds with the default K=15.. The PR adds meaningful unit and composed tests, but the changed runtime/sandbox/infrastructure behavior still depends on real Docker daemon semantics, OpenShell supervisor timing, container rollback side effects, production image staging, root/non-root file ownership, and /tmp trust-boundary permissions.
  • **Runtime validation is still important for the changed sandbox lifecycle paths** — Add or identify targeted runtime validation for the behavior-specific cases listed in testDepth.suggestedTests, focused on actual Docker/OpenShell and ownership behavior rather than external job status.
  • **Acceptance clause:** [WSL2 x86_64][Sandbox] OpenShell supervisor fails to reconnect to GPU-patched sandbox container; sandbox enters Error phase #4664 Description: "the GPU-patched container starts with `--gpus all` OK and shows `Status=running`, `ExitCode=0`, `Health=starting`." — add test evidence or identify existing coverage. The PR raises the generic Error-phase debounce to 15 polls and tests transient Error recovery, but waitForOpenShellSupervisorReconnect does not inspect Docker State.Status, ExitCode, or Health.Status.
  • **Acceptance clause:** [WSL2 x86_64][Sandbox] OpenShell supervisor fails to reconnect to GPU-patched sandbox container; sandbox enters Error phase #4664 Expected Result: "After GPU patch recreates the sandbox container with `--gpus all`, OpenShell supervisor should reconnect to the new container, the sandbox phase should transition to `Ready`, and onboard should proceed to `[7/8]` / `[8/8]`." — add test evidence or identify existing coverage. createDockerGpuSandboxCreatePatch waits for supervisor reconnect and finalizes the backup after `supervisorReady=true`, with composed tests for that flow. Issue comments report PR fix(onboard): harden GPU patch reconnect and rc cleanup against non-fatal failures #4777 reaching `[7/8]` and `[8/8]`, but there is no in-repo test that drives the full onboard progression through those phases.
Since last review details

Current findings:

  • Runtime validation is still important for the changed sandbox lifecycle paths: The PR adds useful unit and composed tests, including rc cleanup fixture tests and mocked Docker rollback/finalize tests. However, the changed behavior depends on real Docker container state, OpenShell supervisor reconnect timing, rollback side effects, production image helper precedence, root/non-root file ownership, and /tmp trust-boundary permissions. Those interactions are security-sensitive and not fully proven by mocks or extracted shell snippets.
    • Recommendation: Add or identify targeted runtime validation for the behavior-specific cases listed in testDepth.suggestedTests, focused on actual Docker/OpenShell and ownership behavior rather than external job status.
    • Evidence: Changed surfaces include scripts/nemoclaw-start.sh, scripts/lib/clean_runtime_shell_env_shim.py, Dockerfile, src/lib/onboard/docker-gpu-patch-finalize.ts, src/lib/onboard/docker-gpu-patch.ts, src/lib/onboard/docker-gpu-sandbox-create.ts, and src/lib/onboard/docker-gpu-supervisor-reconnect.ts. Tests cover helper and rollback contracts mostly through fixtures/mocks.
  • [WSL2 x86_64][Sandbox] OpenShell supervisor fails to reconnect to GPU-patched sandbox container; sandbox enters Error phase #4664 Health=starting retry clause is addressed by an alternate design (src/lib/onboard/docker-gpu-supervisor-reconnect.ts:39): One [WSL2 x86_64][Sandbox] OpenShell supervisor fails to reconnect to GPU-patched sandbox container; sandbox enters Error phase #4664 suggested fix says to retry reconnect when the patched Docker container is `Status=running` plus `Health=starting`. This PR intentionally implements a generic Error-phase debounce plus rollback path instead; that may be acceptable, but the reconnect wait still does not inspect or test the exact Docker State/Health condition from the issue.
  • Deferred create success path still drops backup cleanup failure status (src/lib/onboard/docker-gpu-sandbox-create.ts:184): The helper now returns `backupRemoved=false` when backup removal fails after a successful supervisor reconnect, but the deferred create caller returns immediately on `supervisorReady` without surfacing or logging that incomplete cleanup state. The sandbox is healthy, so this is not fatal, but it can leave a leaked backup container without a clear diagnostic from this path.
    • Recommendation: Consider logging or otherwise preserving `finalizeOutcome.backupRemoved=false` on the `supervisorReady=true` path, and add a behavior test where the deferred create flow succeeds but backup `docker rm` fails.
    • Evidence: finalizeDockerGpuPatchBackup returns `{ backupRemoved: false, rolledBack: false }` when docker rm fails, while waitForSupervisorReconnectIfNeeded calls finalizeBackup and then immediately returns when `supervisorReady` is true.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@laitingsheng laitingsheng added platform: ubuntu Affects Ubuntu Linux environments bug-fix PR fixes a bug or regression area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery and removed platform: ubuntu Affects Ubuntu Linux environments labels Jun 4, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/docker-gpu-patch.ts`:
- Around line 896-903: The rollback currently treats dockerStart as
fire-and-forget and always returns true; update the rollback to capture the
result of d.dockerStart(refs.originalName, containerOpts), check it with
isZeroStatus (like the restored variable), and return false if dockerStart fails
(non-zero), otherwise return true; reference d.dockerRename/restored,
isZeroStatus, d.dockerStart, refs.backupContainerName and refs.originalName so
the change is applied to the same rollback flow and ensure any caller sees the
actual start failure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8c35031f-d881-49ad-b95c-b752b4a841f6

📥 Commits

Reviewing files that changed from the base of the PR and between 949dde7 and 8f580c2.

📒 Files selected for processing (7)
  • scripts/nemoclaw-start.sh
  • src/lib/adapters/docker/container.ts
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-patch.ts
  • src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts
  • src/lib/onboard/docker-gpu-supervisor-reconnect.ts
  • test/nemoclaw-start.test.ts

Comment thread src/lib/onboard/docker-gpu-patch.ts Outdated
…ainer

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26972019075
Target ref: bf9577bd8a47131c80bd9d6dab412ae16622a62a
Workflow ref: main
Requested jobs: gpu-e2e,openclaw-onboard-security-posture-e2e
Summary: 1 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped
openclaw-onboard-security-posture-e2e ✅ success

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/docker-gpu-sandbox-create.ts`:
- Around line 157-161: Update the deferred finalize error message to explicitly
indicate rollback failure when finalizeOutcome.rolledBack is false: locate the
new Error(...) construction that uses finalizeOutcome.rolledBack and replace the
generic "did not reconnect" message in the rolledBack === false branch with a
clear message such as "OpenShell supervisor did not reconnect to the GPU-enabled
container and rollback failed; pre-patch sandbox was NOT restored" (or
equivalent wording) so operators can distinguish reconnect-only failures from
failed rollbacks; keep the rolledBack === true message unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 036a97c9-da7c-4dd1-a309-28633f77f5a0

📥 Commits

Reviewing files that changed from the base of the PR and between 8f580c2 and 73a053a.

📒 Files selected for processing (3)
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-patch.ts
  • src/lib/onboard/docker-gpu-sandbox-create.ts

Comment thread src/lib/onboard/docker-gpu-sandbox-create.ts Outdated
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26991352191
Target ref: 97c77f53e07779c44d706469d055a0b908708f59
Workflow ref: main
Requested jobs: gpu-e2e,gpu-double-onboard-e2e,openclaw-onboard-security-posture-e2e
Summary: 1 passed, 0 failed, 2 skipped

Job Result
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
openclaw-onboard-security-posture-e2e ✅ success

@laitingsheng laitingsheng added v0.0.59 Release target v0.0.60 Release target and removed v0.0.59 Release target labels Jun 5, 2026
@cv

cv commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

@laitingsheng can you address the findings and suggestions in #4777 (comment), please? Otherwise, looking great!

…ocused module

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…oot rc ownership mismatch is fixture-tested

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Comment thread scripts/lib/clean_runtime_shell_env_shim.py Fixed
Comment thread scripts/lib/clean_runtime_shell_env_shim.py Fixed
Comment thread test/clean-runtime-shell-env-shim.test.ts Fixed
Comment thread test/clean-runtime-shell-env-shim.test.ts Fixed
Comment thread test/clean-runtime-shell-env-shim.test.ts Fixed
Comment thread test/clean-runtime-shell-env-shim.test.ts Fixed
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27003399558
Target ref: e8b94a91a39d843217054248a4c9edf0b08c8029
Workflow ref: main
Requested jobs: gpu-e2e,openclaw-onboard-security-posture-e2e
Summary: 0 passed, 1 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped
openclaw-onboard-security-posture-e2e ❌ failure

Failed jobs: openclaw-onboard-security-posture-e2e. Check run artifacts for logs.

… it against CodeQL findings

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng removed the v0.0.60 Release target label Jun 5, 2026
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27005449457
Target ref: e37e7e18fe3a4ee408be756a0d867c754084a050
Workflow ref: main
Requested jobs: gpu-e2e,openclaw-onboard-security-posture-e2e
Summary: 0 passed, 1 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped
openclaw-onboard-security-posture-e2e ❌ failure

Failed jobs: openclaw-onboard-security-posture-e2e. Check run artifacts for logs.

…invariant, trim docker-gpu-patch test hotspot

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27006820492
Target ref: 6647bf25e4db2081dc13dff303c21838c92b1d9e
Workflow ref: main
Requested jobs: gpu-e2e,openclaw-onboard-security-posture-e2e
Summary: 0 passed, 1 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped
openclaw-onboard-security-posture-e2e ❌ failure

Failed jobs: openclaw-onboard-security-posture-e2e. Check run artifacts for logs.

…inst advisor findings

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…nd drop issue refs from comments

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…d context

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27013516426
Target ref: fix/4664-4713-gpu-patch-hardening
Requested jobs: openclaw-onboard-security-posture-e2e,gpu-e2e,gpu-double-onboard-e2e
Summary: 1 passed, 0 failed, 2 skipped

Job Result
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
openclaw-onboard-security-posture-e2e ✅ success

@laitingsheng laitingsheng added the v0.0.60 Release target label Jun 5, 2026
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27013613190
Target ref: de04db9e067397c0d30b2e1ddfd160cd2ada3fa5
Workflow ref: main
Requested jobs: cloud-e2e,openclaw-onboard-security-posture-e2e,gpu-e2e
Summary: 2 passed, 0 failed, 1 skipped

Job Result
cloud-e2e ✅ success
gpu-e2e ⏭️ skipped
openclaw-onboard-security-posture-e2e ✅ success

@prekshivyas prekshivyas self-assigned this Jun 5, 2026
@cv cv merged commit 43efeba into main Jun 5, 2026
97 of 98 checks passed
@cv cv deleted the fix/4664-4713-gpu-patch-hardening branch June 5, 2026 16:02

@prekshivyas prekshivyas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed against the current head (de04db9) — refreshing the earlier approval, which predated a substantial rework. Verified the substantive logic:

  • Rollback path (docker-gpu-patch-finalize.ts): reconcileSupervisorReconnect runs the reconnect probe before removing the backup, so a failed reconnect rolls back instead of leaving the deleted-backup/failed-new state from #4664. rollbackToBackupContainer ordering is correct — stop+rm the new container (freeing the original name), rename backup→original (bail with false if that fails), start it — and the thrown error honestly reports whether the pre-patch sandbox was restored. On the healthy path, backup removal is best-effort with the real rm status surfaced for diagnostics.
  • rc-shim cleanup (clean_runtime_shell_env_shim.py): the ownership guard if uid != 0 && st.st_uid != uid → skip(exit 0) is correct — it only leaves the legacy stanza when it genuinely cannot legally rewrite (not root, not owner), fixing the #4713 errexit crash. The script is otherwise hardened: O_NOFOLLOW (symlink refusal), regular-file check, post-clean verification that the shim is actually gone (exit 1 if it survives a rewritable case), TOCTOU-safe /proc/self/fd reopen with same_file re-validation, and original-mode preservation. The threat-model note (leftover shim only sources the mode-444 proxy-env file) holds.
  • Shipping: the extracted script is staged via build-context.ts, COPY'd to /usr/local/lib/nemoclaw/ in the Dockerfile, and invoked from there by nemoclaw-start.sh (with a source-relative fallback for tests) — no missing-file risk.

The debounce bump (5→15 polls) is a straightforward tolerance increase. Tests cover the rollback success/restore-failure paths, the debounce default, and both rc-cleanup branches. CI green. Good to merge.

miyoungc added a commit that referenced this pull request Jun 6, 2026
## Summary
- Adds the `v0.0.60` section to `docs/about/release-notes.mdx` using the
dev announcement from discussion #4877.
- Fills the source-doc gaps found during release-prep review across
inference, policy tiers, command behavior, security boundaries, Hermes
dashboard/tooling, runtime context, and troubleshooting.
- Refreshes generated agent skills under `.agents/skills/` from the
current Fern docs output and upgrades Fern from `5.44.3` to `5.45.0`.

## Source summary
- #4037 -> `docs/reference/architecture.mdx`,
`docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents
system-only runtime context that stays out of visible chat.
- #4875 -> `docs/reference/architecture.mdx`,
`docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents
try-first sandbox network/filesystem guidance and clearer failure
classification.
- #4788 -> `docs/security/best-practices.mdx`,
`docs/about/release-notes.mdx`: Documents shared OpenClaw
device-approval policy for startup and connect.
- #4768 -> `docs/reference/network-policies.mdx`,
`docs/network-policy/integration-policy-examples.mdx`,
`docs/get-started/quickstart.mdx`,
`docs/get-started/quickstart-hermes.mdx`, `docs/reference/commands.mdx`:
Documents `weather`, `public-reference`, and Hermes managed-tool gateway
preset behavior.
- #3788 and #4864 -> `docs/reference/network-policies.mdx`,
`docs/reference/commands.mdx`: Documents non-interactive policy-tier
fail-fast behavior and interactive prompt fallback.
- #4756 and #4866 -> `docs/reference/commands.mdx`: Documents env-aware
default sandbox resolution for `list`, `status`, and `tunnel` commands.
- #4320 -> `docs/reference/commands.mdx`: Documents `$$nemoclaw tunnel
status` behavior.
- #4328 -> `docs/reference/commands.mdx`: Documents line-scoped policy
preset descriptions in `policy-list`.
- #4580 and #4748 -> `docs/reference/architecture.mdx`: Documents
package-managed OpenShell gateway service and Docker-driver
gateway-marker behavior.
- #4598 -> `docs/manage-sandboxes/lifecycle.mdx`: Documents concurrent
gateway/dashboard cleanup isolation by sandbox name and port.
- #4777 -> `docs/reference/troubleshooting.mdx`: Documents Docker GPU
patch rollback behavior.
- #4610 -> `docs/reference/troubleshooting.mdx`,
`docs/reference/commands.mdx`: Keeps mutable OpenClaw config permission
guidance aligned and removes skipped experimental wording.
- #4868 -> `docs/reference/commands.mdx`: Keeps `.dockerignore` handling
for custom `onboard --from <Dockerfile>` contexts in generated skills.
- #4870 -> `docs/reference/commands.mdx`,
`docs/manage-sandboxes/runtime-controls.mdx`: Documents
`NEMOCLAW_MINIMAL_BOOTSTRAP` and generated skill coverage.
- #4641 -> `docs/inference/inference-options.mdx`,
`docs/reference/troubleshooting.mdx`: Documents local NVIDIA NIM
platform-digest pulls and served-model id adoption.
- #4810 and #4867 -> `docs/inference/inference-options.mdx`: Documents
stable NGC managed-vLLM image lineage and DGX Station DeepSeek V4 Flash
coverage.
- #4852 -> `docs/inference/use-local-inference.mdx`,
`docs/reference/troubleshooting.mdx`: Documents Ollama model fit
filtering, 16K context floor, cold-load retry, and failed-model
exclusion.
- #4847 -> `docs/inference/switch-inference-providers.mdx`: Documents
API-family sync, Hermes `api_mode`, and Bedrock Runtime exception.
- #4800 -> `docs/inference/tool-calling-reliability.mdx`: Documents
Nemotron managed-inference native tool-search fallback.
- #4333 -> `docs/inference/switch-inference-providers.mdx`: Documents
interactive multimodal input prompting.
- #4086 -> `docs/reference/troubleshooting.mdx`: Keeps proxy bypass
normalization in generated troubleshooting coverage.
- #4811 and #4855 -> `docs/get-started/quickstart-hermes.mdx`: Documents
prebuilt Hermes dashboard assets and TUI recovery without runtime
rebuilds.
- #4854 -> `docs/inference/switch-inference-providers.mdx`,
`docs/reference/commands.mdx`: Documents Hermes proxy API-key
placeholder preservation during inference switches.
- #4248 -> `docs/manage-sandboxes/messaging-channels.mdx`,
`.agents/skills/`: Keeps messaging enrollment behavior aligned with
manifest-hook implementation.
- #4771 -> `docs/security/best-practices.mdx`,
`docs/security/credential-storage.mdx`: Documents Hermes
placeholder-only secret boundary for sandbox-visible runtime files.
- #4787 -> `docs/security/best-practices.mdx`,
`docs/about/release-notes.mdx`: Documents expanded memory scanner
examples for OpenAI project keys and Slack app-level tokens.
- #4848 -> `docs/reference/commands.mdx`: Documents OpenClaw skill
install mirroring into the agent home directory.
- #4790 -> `docs/about/release-notes.mdx`: Uses the prior release-prep
structure and generated `.agents/skills/` refresh as the template for
this release.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ skills/
--prefix nemoclaw-user --doc-platform fern-mdx --dry-run`
- `npm run docs`
- `git diff --check`
- skip-term scan across `docs/`, `.agents/skills/`, and `skills/`
- `npm run build:cli`
- `npm run typecheck:cli`
- Commit and pre-push hook suites, including markdownlint, gitleaks,
env-var docs gate, docs-to-skills verification, and skills YAML tests

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* DeepSeek-V4-Flash now available as default inference model for DGX
Station.
* Hermes dashboard improved with dedicated port and OAuth-authenticated
tool gateway selection.
* Added weather and public-reference policy presets for expanded agent
capabilities.
* Enhanced Ollama model selection with GPU memory filtering and
automatic retry for timeouts.

* **Bug Fixes**
  * Improved policy tier validation to prevent invalid configurations.
* Better sandbox cleanup scoping by port to prevent conflicts across
deployments.
  * Added GPU patch failure recovery with automatic rollback.

* **Documentation**
* Expanded troubleshooting guides for inference, security, and sandbox
lifecycle.
  * Added .dockerignore best practices for custom deployments.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression v0.0.60 Release target

Projects

None yet

5 participants