Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/reference/troubleshooting.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -1254,6 +1254,11 @@ Fix the NVIDIA Container Toolkit or CDI configuration reported in the diagnostic
If you do not need GPU access inside the sandbox, rerun with `--no-sandbox-gpu`.
Set `NEMOCLAW_DOCKER_GPU_PATCH=0` only when you need to bypass this compatibility path during troubleshooting.

If onboarding reports `OpenShell supervisor did not reconnect to the GPU-enabled container.` even though the diagnostic bundle shows the patched container is running and healthy, the supervisor-reconnect wait is treating a transient Error phase (reported while the OpenShell host re-registers the new container) as fatal.
The reconnect wait debounces consecutive Error-phase polls before fast-failing, defaulting to five consecutive polls of about 10 seconds in total.
Increase the debounce window with `NEMOCLAW_DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_DEBOUNCE` if your host needs more time to re-register the patched container, for example slow WSL2 + Docker Desktop setups.
Set it to a higher integer such as `15` (about 30 seconds) and rerun onboarding; the value is clamped to a minimum of `1`.

### `pip install` fails with a system-packages error

Recent Ubuntu releases (including DGX Spark's Ubuntu 24.04) mark the system Python install as externally managed, so `pip install` without a virtual environment fails.
Expand Down
5 changes: 5 additions & 0 deletions skills/nemoclaw-user-reference/references/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -1244,6 +1244,11 @@ Fix the NVIDIA Container Toolkit or CDI configuration reported in the diagnostic
If you do not need GPU access inside the sandbox, rerun with `--no-sandbox-gpu`.
Set `NEMOCLAW_DOCKER_GPU_PATCH=0` only when you need to bypass this compatibility path during troubleshooting.

If onboarding reports `OpenShell supervisor did not reconnect to the GPU-enabled container.` even though the diagnostic bundle shows the patched container is running and healthy, the supervisor-reconnect wait is treating a transient Error phase (reported while the OpenShell host re-registers the new container) as fatal.
The reconnect wait debounces consecutive Error-phase polls before fast-failing, defaulting to five consecutive polls of about 10 seconds in total.
Increase the debounce window with `NEMOCLAW_DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_DEBOUNCE` if your host needs more time to re-register the patched container, for example slow WSL2 + Docker Desktop setups.
Set it to a higher integer such as `15` (about 30 seconds) and rerun onboarding; the value is clamped to a minimum of `1`.

### `pip install` fails with a system-packages error

Recent Ubuntu releases (including DGX Spark's Ubuntu 24.04) mark the system Python install as externally managed, so `pip install` without a virtual environment fails.
Expand Down
8 changes: 6 additions & 2 deletions src/lib/onboard/docker-gpu-patch.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -837,7 +837,10 @@ describe("docker-gpu-patch Error-phase diagnostics (#4316)", () => {
it("short-circuits the supervisor-reconnect wait when the sandbox enters Error phase", () => {
// Without the short-circuit, a patched container that crashes on startup
// leaves users waiting the full 900s+ supervisor-reconnect timeout before
// any Error-phase diagnostics run (#4316).
// any Error-phase diagnostics run. With the debounce now in place, this
// test asserts the K=1 (no-debounce) behavior explicitly so the original
// fast-fail intent is preserved when the operator opts out of the
// debounce.
const runOpenshell = vi.fn(() => ({ status: 1, stderr: "sandbox not ready" }));
const listOutputs = [
"alpha Provisioning 1s ago",
Expand All @@ -853,10 +856,11 @@ describe("docker-gpu-patch Error-phase diagnostics (#4316)", () => {
runOpenshell,
runCaptureOpenshell,
sleep,
errorPhaseDebouncePolls: 1,
});

expect(ok).toBe(false);
// Without short-circuit we'd loop ~300 iterations. With it, the second
// Without short-circuit we'd loop ~300 iterations. With K=1 the second
// iteration's list output shows Error and the wait bails out.
expect(runOpenshell).toHaveBeenCalledTimes(2);
expect(sleep).toHaveBeenCalledTimes(1);
Expand Down
93 changes: 22 additions & 71 deletions src/lib/onboard/docker-gpu-patch.ts
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,22 @@ import {
dockerRunDetached,
dockerStop,
} from "../adapters/docker";
import { envInt } from "./env";
import {
type DockerGpuSupervisorReconnectDeps,
DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_DEBOUNCE_ENV,
DOCKER_GPU_SUPERVISOR_RECONNECT_TIMEOUT_ENV,
getDockerGpuSupervisorReconnectErrorDebouncePolls,
getDockerGpuSupervisorReconnectTimeoutSecs,
waitForOpenShellSupervisorReconnect,
} from "./docker-gpu-supervisor-reconnect";
export {
DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_DEBOUNCE_ENV,
DOCKER_GPU_SUPERVISOR_RECONNECT_TIMEOUT_ENV,
getDockerGpuSupervisorReconnectErrorDebouncePolls,
getDockerGpuSupervisorReconnectTimeoutSecs,
waitForOpenShellSupervisorReconnect,
};
export type { DockerGpuSupervisorReconnectDeps };

export const OPENSHELL_MANAGED_BY_LABEL = "openshell.ai/managed-by";
export const OPENSHELL_MANAGED_BY_VALUE = "openshell";
Expand All @@ -23,9 +38,6 @@ const OPENSHELL_SANDBOX_COMMAND_ENV = "OPENSHELL_SANDBOX_COMMAND";

const DOCKER_GPU_PATCH_TIMEOUT_MS = 30_000;
const DOCKER_GPU_PATCH_WAIT_SECS = 180;
const DOCKER_GPU_SUPERVISOR_RECONNECT_MIN_SECS = 900;
export const DOCKER_GPU_SUPERVISOR_RECONNECT_TIMEOUT_ENV =
"NEMOCLAW_DOCKER_GPU_SUPERVISOR_RECONNECT_TIMEOUT";
export const DOCKER_GPU_PATCH_NETWORK_ENV = "NEMOCLAW_DOCKER_GPU_PATCH_NETWORK";
const MAX_DOCKER_CONTAINER_NAME_LENGTH = 253;
const GPU_ENV_KEYS = new Set([
Expand Down Expand Up @@ -70,6 +82,11 @@ export type DockerGpuPatchDeps = {
readDir?: (dirPath: string) => string[] | null;
/** Injectable file reader for unit testing CDI spec content checks. */
readFile?: (filePath: string) => string | null;
/**
* Forwarded to the supervisor-reconnect wait. See
* `DockerGpuSupervisorReconnectDeps.errorPhaseDebouncePolls`.
*/
errorPhaseDebouncePolls?: number;
};

export type DockerGpuPatchModeKind = "gpus" | "nvidia-runtime" | "cdi";
Expand Down Expand Up @@ -833,72 +850,6 @@ function waitForNewContainerId(
return null;
}

function sandboxListShowsErrorPhase(
sandboxName: string,
runCaptureOpenshell: NonNullable<DockerGpuPatchDeps["runCaptureOpenshell"]>,
): boolean {
try {
const list = runCaptureOpenshell(["sandbox", "list"], {
ignoreError: true,
suppressOutput: true,
timeout: DOCKER_GPU_PATCH_TIMEOUT_MS,
});
return SANDBOX_FAILURE_PHASE_TOKENS.has(
parseSandboxPhaseFromListOutput(list, sandboxName) ?? "",
);
} catch {
return false;
}
}

function waitForOpenShellSandboxExec(
sandboxName: string,
timeoutSecs: number,
deps: DockerGpuPatchDeps,
): boolean {
if (!deps.runOpenshell) return true;
const d = depsWithDefaults(deps);
const deadline = Date.now() + Math.max(1, timeoutSecs) * 1000;
while (Date.now() <= deadline) {
const result = deps.runOpenshell(
["sandbox", "exec", "-n", sandboxName, "--", "true"],
{ ignoreError: true, suppressOutput: true, timeout: DOCKER_GPU_PATCH_TIMEOUT_MS },
);
if (isZeroStatus(result)) return true;
// Short-circuit the supervisor-reconnect wait when the sandbox enters a
// terminal failure phase. Without this, a patched container that exits
// on startup leaves the user staring at the supervisor-reconnect
// timeout (default 900s) before any Error-phase diagnostics run (#4316).
if (
deps.runCaptureOpenshell &&
sandboxListShowsErrorPhase(sandboxName, deps.runCaptureOpenshell)
) {
return false;
}
d.sleep(2);
}
return false;
}

export const waitForOpenShellSupervisorReconnect = waitForOpenShellSandboxExec;

export function getDockerGpuSupervisorReconnectTimeoutSecs(
sandboxReadyTimeoutSecs: number,
env: Record<string, string | undefined> = process.env,
): number {
const readyTimeoutSecs = Number.isFinite(sandboxReadyTimeoutSecs)
? Math.max(1, Math.round(sandboxReadyTimeoutSecs))
: 1;
const fallback = Math.max(
readyTimeoutSecs,
DOCKER_GPU_SUPERVISOR_RECONNECT_MIN_SECS,
);
return Math.max(
1,
envInt(DOCKER_GPU_SUPERVISOR_RECONNECT_TIMEOUT_ENV, fallback, env),
);
}

function decoratePatchError<T extends Error>(
error: T,
context: DockerGpuPatchFailureContext,
Expand Down Expand Up @@ -1017,7 +968,7 @@ export function recreateOpenShellDockerSandboxWithGpu(
});

if (options.waitForSupervisor !== false) {
const execReady = waitForOpenShellSandboxExec(
const execReady = waitForOpenShellSupervisorReconnect(
options.sandboxName,
options.timeoutSecs ?? DOCKER_GPU_PATCH_WAIT_SECS,
deps,
Expand Down
159 changes: 159 additions & 0 deletions src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0

import { describe, expect, it, vi } from "vitest";

import {
getDockerGpuSupervisorReconnectErrorDebouncePolls,
waitForOpenShellSupervisorReconnect,
} from "../../../dist/lib/onboard/docker-gpu-supervisor-reconnect";

// The Docker GPU patch supervisor-reconnect wait must absorb a transient
// Error phase reported while OpenShell's sandbox-list cache catches up to
// the newly-recreated GPU container. The old-container teardown briefly
// marks the row Error before the host re-registers the new container.
// Without debouncing, the fast-fail short-circuits within ~12s on a healthy
// GPU sandbox whose container is running and whose supervisor has already
// logged `LIFECYCLE:INSTALL OpenShell Sandbox Supervisor success`.
describe("docker-gpu-supervisor-reconnect Error-phase debounce", () => {
it("absorbs a transient Error phase shorter than the debounce window", () => {
const execOutputs = [
{ status: 1, stderr: "sandbox not ready" },
{ status: 1, stderr: "sandbox not ready" },
{ status: 1, stderr: "sandbox not ready" },
{ status: 0, stdout: "" },
];
let execIdx = 0;
const runOpenshell = vi.fn(
() => execOutputs[Math.min(execIdx++, execOutputs.length - 1)],
);
const listOutputs = [
"alpha Error 1s ago",
"alpha Error 3s ago",
"alpha Provisioning 5s ago",
"alpha Ready 7s ago",
];
let listIdx = 0;
const runCaptureOpenshell = vi.fn(
() => listOutputs[Math.min(listIdx++, listOutputs.length - 1)],
);
const sleep = vi.fn();

const ok = waitForOpenShellSupervisorReconnect("alpha", 600, {
runOpenshell,
runCaptureOpenshell,
sleep,
errorPhaseDebouncePolls: 5,
});

expect(ok).toBe(true);
expect(runOpenshell).toHaveBeenCalledTimes(4);
});

it("still fast-fails when Error phase persists for the full debounce window", () => {
const runOpenshell = vi.fn(() => ({ status: 1, stderr: "sandbox not ready" }));
const runCaptureOpenshell = vi.fn(() => "alpha Error 1s ago");
const sleep = vi.fn();

const ok = waitForOpenShellSupervisorReconnect("alpha", 600, {
runOpenshell,
runCaptureOpenshell,
sleep,
errorPhaseDebouncePolls: 3,
});

expect(ok).toBe(false);
// Three consecutive Error polls trigger the short-circuit on poll 3.
// Sleeps happen only between polls 1->2 and 2->3, so two sleeps total.
expect(runOpenshell).toHaveBeenCalledTimes(3);
expect(sleep).toHaveBeenCalledTimes(2);
});

it("resets the consecutive-Error counter when the phase recovers", () => {
// Error, Error, Provisioning (counter resets), Error, Error, Error
// -> bails out on the 3rd post-recovery Error, not earlier.
const runOpenshell = vi.fn(() => ({ status: 1, stderr: "sandbox not ready" }));
const listOutputs = [
"alpha Error 1s ago",
"alpha Error 3s ago",
"alpha Provisioning 5s ago",
"alpha Error 7s ago",
"alpha Error 9s ago",
"alpha Error 11s ago",
];
let listIdx = 0;
const runCaptureOpenshell = vi.fn(
() => listOutputs[Math.min(listIdx++, listOutputs.length - 1)],
);
const sleep = vi.fn();

const ok = waitForOpenShellSupervisorReconnect("alpha", 600, {
runOpenshell,
runCaptureOpenshell,
sleep,
errorPhaseDebouncePolls: 3,
});

expect(ok).toBe(false);
expect(runOpenshell).toHaveBeenCalledTimes(6);
});

it("defaults the debounce to 5 polls and honors the env override", () => {
expect(getDockerGpuSupervisorReconnectErrorDebouncePolls({})).toBe(5);
expect(
getDockerGpuSupervisorReconnectErrorDebouncePolls({
NEMOCLAW_DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_DEBOUNCE: "2",
}),
).toBe(2);
// Non-positive values are clamped to a minimum of 1.
expect(
getDockerGpuSupervisorReconnectErrorDebouncePolls({
NEMOCLAW_DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_DEBOUNCE: "0",
}),
).toBe(1);
});

it("clamps an injected debounce override to the same minimum as the env path", () => {
// 0 / negative / fractional overrides must not bypass the ≥1 contract that
// the env-backed helper enforces.
const runOpenshell = vi.fn(() => ({ status: 1, stderr: "sandbox not ready" }));
const runCaptureOpenshell = vi.fn(() => "alpha Error 1s ago");
const sleep = vi.fn();

const ok = waitForOpenShellSupervisorReconnect("alpha", 600, {
runOpenshell,
runCaptureOpenshell,
sleep,
errorPhaseDebouncePolls: 0,
});

expect(ok).toBe(false);
// Clamped to K=1: first Error poll short-circuits with no preceding sleep.
expect(runOpenshell).toHaveBeenCalledTimes(1);
expect(sleep).not.toHaveBeenCalled();
});

it("falls back to the env-backed default when an injected override is non-finite", () => {
// NaN / +Infinity / -Infinity overrides must not silently neutralise the
// fast-fail loop. A NaN comparison would always be false and `Infinity`
// would never satisfy `>= debouncePolls`, leaving the wait to burn the
// full timeout window.
for (const bogus of [Number.NaN, Number.POSITIVE_INFINITY, Number.NEGATIVE_INFINITY]) {
const runOpenshell = vi.fn(() => ({ status: 1, stderr: "sandbox not ready" }));
const runCaptureOpenshell = vi.fn(() => "alpha Error 1s ago");
const sleep = vi.fn();

const ok = waitForOpenShellSupervisorReconnect("alpha", 600, {
runOpenshell,
runCaptureOpenshell,
sleep,
errorPhaseDebouncePolls: bogus,
});

expect(ok).toBe(false);
// Default K=5 from the env-backed helper: 5 polls + 4 sleeps before fast-fail.
expect(runOpenshell).toHaveBeenCalledTimes(5);
expect(sleep).toHaveBeenCalledTimes(4);
}
});
});
Loading
Loading