Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions docs/reference/commands.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -402,9 +402,10 @@ $ nemoclaw my-assistant recover
Show sandbox status, health, and inference configuration.

Pass `--json` to emit a structured per-sandbox report instead of the text renderer.
The JSON output includes at least `schemaVersion`, `name`, `found`, `model`, `provider`, `phase`, `gatewayState`, `inferenceHealth`, `rpcIssue`, `hostGpuDetected`, `sandboxGpuEnabled`, `sandboxGpuMode`, `sandboxGpuDevice`, `openshellDriver`, `openshellVersion`, and `policies`.
The JSON output includes at least `schemaVersion`, `name`, `found`, `model`, `provider`, `phase`, `gatewayState`, `inferenceHealth`, `rpcIssue`, `hostGpuDetected`, `sandboxGpuEnabled`, `sandboxGpuMode`, `sandboxGpuDevice`, `openshellDriver`, `openshellVersion`, `policies`, `failureLayer`, and `dockerPaused`.
`openshellDriver` and `openshellVersion` are always strings (falling back to `"unknown"` when the registry has no value), so consumers can rely on `typeof` checks.
The command exits non-zero when the sandbox is missing locally, the gateway state is not `present`, or the gateway reports a schema/protobuf mismatch (mirrored as `rpcIssue`).
`failureLayer` is `null` when no preflight failure was detected and otherwise one of `docker_unreachable`, `sandbox_container_stopped`, or `sandbox_dashboard_port_conflict`; when set, `inferenceHealth` is suppressed to `null` so automation does not see a stale remote-provider healthy status during a local outage.
The command exits non-zero when the sandbox is missing locally, the gateway state is not `present`, the gateway reports a schema/protobuf mismatch (mirrored as `rpcIssue`), or `failureLayer` is non-null.
The alias form `nemoclaw <name> status --json` requires the sandbox to be registered locally; the canonical form `nemoclaw sandbox status <name> --json` is the one to use from automation that may run against an unknown sandbox name, since it still emits a JSON document with `found: false` instead of a text error.

```console
Expand All @@ -431,6 +432,11 @@ Use that line to distinguish a healthy backend from a broken proxy path that the

For cloud-only providers, the output omits the NIM status line unless a NIM container is registered or an unexpected NIM container is running.

When the sandbox's recorded driver is `docker` and the host Docker daemon is not reachable, the command prints `Failure layer: docker_unreachable — Docker daemon is not reachable.` as the first line of stdout, suppresses the host-side `Inference` probe (which otherwise hits the remote provider directly and is misleading when the local stack is down), and exits with a non-zero status.

When the host Docker daemon is reachable but the per-sandbox container is stopped, the command prints `Failure layer: sandbox_container_stopped — sandbox container exists but is not running.` as the first line of stdout, suppresses the host-side `Inference` probe, and exits with a non-zero status.
If the sandbox's recorded dashboard port is also held by a foreign listener, the header escalates to `Failure layer: sandbox_dashboard_port_conflict — sandbox container is stopped and the dashboard port is held by a foreign listener.` so the operator can recover the port before restarting the sandbox.

If the sandbox or gateway cannot be verified, the command exits non-zero instead of reporting healthy inference from stale registry state.
Gateway and dashboard health checks treat HTTP `401` from device auth as a live service, not as an offline gateway.

Expand Down
7 changes: 6 additions & 1 deletion src/commands/sandbox/status.ts
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,12 @@ export default class SandboxStatusCommand extends NemoClawCommand {
const { args } = await this.parse(SandboxStatusCommand);
if (this.jsonEnabled()) {
const report = await getSandboxStatusReport(args.sandboxName);
if (!report.found || report.gatewayState !== "present" || report.rpcIssue) {
if (
!report.found ||
report.gatewayState !== "present" ||
report.rpcIssue ||
report.failureLayer
) {
process.exitCode = 1;
}
// #4310: route the machine-readable report through the centralized
Expand Down
48 changes: 6 additions & 42 deletions src/lib/actions/sandbox/docker-health.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import { dockerContainerInspectFormat } from "../../adapters/docker/inspect";
import { dockerCapture } from "../../adapters/docker/run";
import * as registry from "../../state/registry";
import { resolveSandboxContainerOwner } from "./sandbox-container-owner";

export type DockerHealthState =
| "healthy"
Expand Down Expand Up @@ -68,48 +69,11 @@ function resolveDockerDriverSandboxContainer(
} catch {
return null;
}
// OpenShell names sandbox containers either as `openshell-<sandbox>`
// (no suffix) or `openshell-<sandbox>-<id>` where <id> is a runtime
// identifier appended by openshell. Two ambiguities to avoid:
//
// 1. A sandbox name can be a prefix of another sandbox name
// (`my` vs `my-assistant`).
// 2. Even with a hyphen-free suffix, a sandbox name can be a prefix
// of another sandbox name whose own suffix is hyphen-free
// (`my-assistant` vs `my-assistant-prod`).
//
// To disambiguate, resolve each candidate to the LONGEST registered
// sandbox name it could belong to. We only accept a candidate when
// that resolved owner is the sandbox we are looking up. This also
// gives the right answer for the `openshell-<sandbox>` exact form.
const ourPrefix = `openshell-${sandboxName}-`;
const ourExact = `openshell-${sandboxName}`;
const knownSandboxes = new Set(deps.listSandboxNames());
knownSandboxes.add(sandboxName);
const candidates = deps
.dockerPsNames()
.split("\n")
.map((line) => line.trim())
.filter((line) => line === ourExact || line.startsWith(ourPrefix));
// Prefer the exact-name container before considering suffixed ones.
// Without this short-circuit, a suffixed `openshell-<name>-<id>` whose
// <id> is a docker runtime suffix (not a registered sandbox name) would
// resolve to our sandbox via the longest-match heuristic and beat the
// co-existing exact `openshell-<name>` if it appeared earlier in
// `docker ps`.
if (candidates.includes(ourExact)) return ourExact;
for (const candidate of candidates) {
const stripped = candidate.replace(/^openshell-/, "");
// Find the longest known sandbox whose container name pattern
// matches this candidate. Longest-first defeats prefix collisions.
const owner = [...knownSandboxes]
.filter(
(name) => stripped === name || stripped.startsWith(`${name}-`),
)
.sort((a, b) => b.length - a.length)[0];
if (owner === sandboxName) return candidate;
}
return null;
return resolveSandboxContainerOwner(
deps.dockerPsNames(),
sandboxName,
deps.listSandboxNames(),
);
}

function normalizeHealthState(raw: string): DockerHealthState {
Expand Down
110 changes: 104 additions & 6 deletions src/lib/actions/sandbox/gateway-failure-classifier.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import { dockerCapture } from "../../adapters/docker/run";
import { CLI_NAME } from "../../cli/branding";
import { GATEWAY_PORT } from "../../core/ports";
import * as registry from "../../state/registry";
import { resolveSandboxContainerOwner } from "./sandbox-container-owner";

const DEFAULT_CONTAINER = "openshell-cluster-nemoclaw";
const DOCKER_TIMEOUT_MS = 3000;
Expand All @@ -18,7 +19,9 @@ export type GatewayFailureLayer =
| "container_missing"
| "container_exited_port_conflict"
| "container_exited"
| "gateway_unreachable";
| "gateway_unreachable"
| "sandbox_container_stopped"
| "sandbox_dashboard_port_conflict";

export type GatewayFailureResult = {
layer: GatewayFailureLayer;
Expand All @@ -32,10 +35,30 @@ export type GatewayFailureRunners = {
portProbe: (port: number) => Promise<boolean>;
};

export type SandboxContainerFailureLayer =
| "sandbox_container_stopped"
| "sandbox_dashboard_port_conflict";

export type SandboxContainerFailureResult = {
layer: SandboxContainerFailureLayer;
detail: string;
};

export type SandboxContainerFailureRunners = {
listAllContainerNames: () => string;
listRunningContainerNames: () => string;
listSandboxNames: () => string[];
portProbe: (port: number) => Promise<boolean>;
};

function defaultDockerInfo(): boolean {
return dockerInfo({ ignoreError: true, timeout: DOCKER_TIMEOUT_MS }).length > 0;
}

export function isDockerDaemonReachable(): boolean {
return defaultDockerInfo();
}

function dockerContainerListed(container: string, allFlag: boolean): boolean {
const args = ["ps"];
if (allFlag) args.push("-a");
Expand Down Expand Up @@ -129,12 +152,85 @@ const LAYER_HEADERS: Record<GatewayFailureLayer, string> = {
container_exited: "Failure layer: container_exited — container exited.",
gateway_unreachable:
"Failure layer: gateway_unreachable — container running but gateway API unresponsive.",
sandbox_container_stopped:
"Failure layer: sandbox_container_stopped — sandbox container exists but is not running.",
sandbox_dashboard_port_conflict:
"Failure layer: sandbox_dashboard_port_conflict — sandbox container is stopped and the dashboard port is held by a foreign listener.",
};

export function getLayerHeader(layer: GatewayFailureLayer): string {
return LAYER_HEADERS[layer];
}

function defaultListAllContainerNames(): string {
return dockerCapture(["ps", "-a", "--format", "{{.Names}}"], {
ignoreError: true,
timeout: DOCKER_TIMEOUT_MS,
});
}

function defaultListRunningContainerNames(): string {
return dockerCapture(["ps", "--format", "{{.Names}}"], {
ignoreError: true,
timeout: DOCKER_TIMEOUT_MS,
});
}

function defaultListSandboxNames(): string[] {
try {
return registry.listSandboxes().sandboxes.map((entry) => entry.name);
} catch {
return [];
}
}

const defaultSandboxContainerRunners: SandboxContainerFailureRunners = {
listAllContainerNames: defaultListAllContainerNames,
listRunningContainerNames: defaultListRunningContainerNames,
listSandboxNames: defaultListSandboxNames,
portProbe: defaultPortProbe,
};

function isValidDashboardPort(port: number | null | undefined): port is number {
return (
typeof port === "number" && Number.isInteger(port) && port >= 1 && port <= 65535
);
}

export async function classifySandboxContainerFailure(
sandboxName: string,
opts: {
dashboardPort?: number | null;
runners?: SandboxContainerFailureRunners;
} = {},
): Promise<SandboxContainerFailureResult | null> {
const runners = opts.runners ?? defaultSandboxContainerRunners;
const registeredSandboxNames = runners.listSandboxNames();
const running = resolveSandboxContainerOwner(
runners.listRunningContainerNames(),
sandboxName,
registeredSandboxNames,
);
if (running) return null;
const present = resolveSandboxContainerOwner(
runners.listAllContainerNames(),
sandboxName,
registeredSandboxNames,
);
if (!present) return null;
const dashboardPort = opts.dashboardPort;
if (isValidDashboardPort(dashboardPort) && (await runners.portProbe(dashboardPort))) {
return {
layer: "sandbox_dashboard_port_conflict",
detail: `Sandbox container '${present}' is stopped and dashboard port ${dashboardPort} is held by another process.`,
};
}
return {
layer: "sandbox_container_stopped",
detail: `Sandbox container '${present}' exists but is not running.`,
};
}

type SandboxDriverLookup = (
name: string,
) => { openshellDriver?: string | null } | null | undefined;
Expand All @@ -160,7 +256,10 @@ const NON_DOCKER_DRIVERS = new Set(["vm"]);
* guidance on a Docker-less host; that is preferable to silently regressing
* every legacy Docker sandbox. (#4428)
*/
function isDockerBackedSandbox(sandboxName: string, getSandbox: SandboxDriverLookup): boolean {
function isDockerBackedSandbox(
sandboxName: string,
getSandbox: SandboxDriverLookup,
): boolean {
const driver = getSandbox(sandboxName)?.openshellDriver;
return !(typeof driver === "string" && NON_DOCKER_DRIVERS.has(driver.toLowerCase()));
}
Expand All @@ -170,10 +269,9 @@ function isDockerBackedSandbox(sandboxName: string, getSandbox: SandboxDriverLoo
* `docker_unreachable` layer of {@link classifyGatewayFailure}). Sandbox
* commands use this as a fast preflight so a transient Docker daemon outage is
* classified as a host runtime problem rather than a stuck sandbox phase or a
* connect timeout (#4428). Returns `false` for non-Docker drivers so VM/
* Kubernetes sandboxes are never misclassified. `docker info` is a `spawnSync`
* call, so this stays synchronous and can run from non-async call sites such
* as `logs` and `policy-list`.
* connect timeout (#4428). Returns `false` for VM sandboxes so they are never
* misclassified. `docker info` is a `spawnSync` call, so this stays synchronous
* and can run from non-async call sites such as `logs` and `policy-list`.
*/
export function isDockerRuntimeDown(
sandboxName: string,
Expand Down
46 changes: 46 additions & 0 deletions src/lib/actions/sandbox/sandbox-container-owner.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0

/**
* Resolve which OpenShell container owns a given sandbox name.
*
* OpenShell names sandbox containers either as `openshell-<sandbox>` (no
* suffix) or `openshell-<sandbox>-<id>`, where `<id>` is appended by openshell
* at runtime. Two prefix collisions are possible:
*
* 1. A sandbox name can be a prefix of another sandbox name
* (`my` vs `my-assistant`).
* 2. Even with a hyphen-free `<id>`, a sandbox name can be a prefix
* of another sandbox name whose own suffix is hyphen-free
* (`my-assistant` vs `my-assistant-prod`).
*
* The longest-owner rule resolves each candidate to the longest registered
* sandbox name that could claim it, then only accepts candidates that resolve
* back to the queried sandbox. The exact-name form is preferred before
* suffixed forms so `openshell-<sandbox>` always wins over an unrelated
* `openshell-<sandbox>-<runtime-id>` co-tenant.
*/
export function resolveSandboxContainerOwner(
containerNamesRaw: string,
sandboxName: string,
registeredSandboxNames: Iterable<string>,
): string | null {
const ourPrefix = `openshell-${sandboxName}-`;
const ourExact = `openshell-${sandboxName}`;
const known = new Set<string>(registeredSandboxNames);
known.add(sandboxName);
const candidates = containerNamesRaw
.split("\n")
.map((line) => line.trim())
.filter((line) => line === ourExact || line.startsWith(ourPrefix));
if (candidates.includes(ourExact)) return ourExact;
const knownArr = [...known];
for (const candidate of candidates) {
const stripped = candidate.replace(/^openshell-/, "");
const owner = knownArr
.filter((name) => stripped === name || stripped.startsWith(`${name}-`))
.sort((a, b) => b.length - a.length)[0];
if (owner === sandboxName) return candidate;
}
return null;
}
Loading
Loading