fix(cli): surface docker/sandbox container/dashboard port failure layers in sandbox status#4388
Conversation
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds upfront preflight classification for Docker-driver sandboxes (daemon unreachable, container stopped, dashboard-port conflict), threads inference-probe suppression into snapshot collection, exposes ChangesDocker Unreachable Detection
Sequence Diagram(s)sequenceDiagram
participant CLI as showSandboxStatus
participant Preflight as classifySandboxStatusPreflightFailure
participant Docker as isDockerDaemonReachable
participant ContainerClassifier as classifySandboxContainerFailure
participant PS as dockerPsNames
participant Port as portProbe
CLI->>Preflight: ask preflight(sandbox)
Preflight->>Docker: check docker daemon reachability
alt docker unreachable
Docker-->>Preflight: unreachable -> docker_unreachable
else docker reachable
Preflight->>ContainerClassifier: classify sandbox container
ContainerClassifier->>PS: list container names
ContainerClassifier->>Port: probe dashboardPort (if valid)
Port-->>ContainerClassifier: port held / free
end
Preflight-->>CLI: preflight result (failureLayer or null)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 2 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/cli.test.ts (1)
1003-1052: ⚡ Quick winStrengthen the non-docker-driver assertion coverage.
This test currently proves only header absence, not that inference probing remains enabled. Please also assert normal (non-error) status behavior for this path.
Suggested diff
const r = runWithEnv("alpha status", { HOME: home, PATH: `${localBin}:${process.env.PATH || ""}`, }); + expect(r.code).toBe(0); + expect(r.out).toContain("Inference:"); expect(r.out).not.toContain( "Failure layer: docker_unreachable", );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/cli.test.ts` around lines 1003 - 1052, The test "sandbox <name> status preserves Inference probe when openshellDriver is not docker" currently only asserts the absence of "Failure layer: docker_unreachable"; update the test (around runWithEnv("alpha status") / writeSandboxRegistry) to also assert that inference probing and normal status output are present by checking r.out contains the inference probe lines (e.g., "Gateway inference:" and "Provider: openai-api" or "Model: gpt-4o-mini") and that normal gateway status appears (e.g., "Status: Connected"), using the same runWithEnv result and existing expect APIs so you verify probing remains enabled in non-docker openshellDriver cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/cli.test.ts`:
- Around line 1003-1052: The test "sandbox <name> status preserves Inference
probe when openshellDriver is not docker" currently only asserts the absence of
"Failure layer: docker_unreachable"; update the test (around runWithEnv("alpha
status") / writeSandboxRegistry) to also assert that inference probing and
normal status output are present by checking r.out contains the inference probe
lines (e.g., "Gateway inference:" and "Provider: openai-api" or "Model:
gpt-4o-mini") and that normal gateway status appears (e.g., "Status:
Connected"), using the same runWithEnv result and existing expect APIs so you
verify probing remains enabled in non-docker openshellDriver cases.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8b9c4ead-b564-4518-bbce-7e09c2c07300
📒 Files selected for processing (3)
src/lib/actions/sandbox/status.test.tssrc/lib/actions/sandbox/status.tstest/cli.test.ts
Selective E2E Results — ✅ All requested jobs passedRun: 26551292928
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
🌿 Preview your docs: https://nvidia-preview-pr-4388.docs.buildwithfern.com/nemoclaw |
Selective E2E Results — ✅ All requested jobs passedRun: 26552024714
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…ct in status Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26676934408
|
…hboardPort Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26679093934
|
…layers and expose failureLayer to status --json Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26680162950
|
…itions Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26680613479
|
…inted layer Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26704812164
|
Selective E2E Results — ✅ All requested jobs passedRun: 26790614231
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/actions/sandbox/status-preflight.ts (1)
129-154: ⚡ Quick winMisplaced JSDoc: the "Print the exact first-line preflight header" comment is attached to
withoutTerminalPhasePreflight.The doc block at Lines 129-133 describes header printing, but it sits above
withoutTerminalPhasePreflight(which clears terminal-phase preflight state and prints nothing). The function it actually describes,printSandboxStatusPreflightHeaderat Line 147, is left undocumented.📝 Proposed fix to move the doc to the correct function
-/** - * Print the exact first-line preflight header. Unlike gateway-level fallback - * headers this intentionally has no leading indentation because users and - * tests rely on `docker_unreachable` being the first bytes of status output. - */ export function withoutTerminalPhasePreflight( preflight: SandboxStatusPreflightResult, phase: string | null, ): SandboxStatusPreflightResult { if (!phase || !isTerminalSandboxPhase(phase)) return preflight; return { failure: null, failureLayer: null, suppressInferenceProbe: preflight.suppressInferenceProbe, exitCode: 0, }; } +/** + * Print the exact first-line preflight header. Unlike gateway-level fallback + * headers this intentionally has no leading indentation because users and + * tests rely on `docker_unreachable` being the first bytes of status output. + */ export function printSandboxStatusPreflightHeader( preflight: SandboxStatusPreflightResult, writer: (message: string) => void = console.log, ): void {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/actions/sandbox/status-preflight.ts` around lines 129 - 154, The JSDoc describing "Print the exact first-line preflight header" is incorrectly placed above withoutTerminalPhasePreflight; move that doc block to directly above printSandboxStatusPreflightHeader and update its wording to describe header printing (mention getLayerHeader and writer callback). Leave withoutTerminalPhasePreflight either undocumented or add a short doc describing that it clears terminal-phase preflight state (mention isTerminalSandboxPhase and returned fields failure/failureLayer/suppressInferenceProbe/exitCode) so intent remains clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/actions/sandbox/status-preflight.ts`:
- Around line 129-154: The JSDoc describing "Print the exact first-line
preflight header" is incorrectly placed above withoutTerminalPhasePreflight;
move that doc block to directly above printSandboxStatusPreflightHeader and
update its wording to describe header printing (mention getLayerHeader and
writer callback). Leave withoutTerminalPhasePreflight either undocumented or add
a short doc describing that it clears terminal-phase preflight state (mention
isTerminalSandboxPhase and returned fields
failure/failureLayer/suppressInferenceProbe/exitCode) so intent remains clear.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 77de6f99-3f42-4d19-8b08-e66c11e8be42
📒 Files selected for processing (8)
docs/reference/commands.mdxsrc/lib/actions/sandbox/docker-health.tssrc/lib/actions/sandbox/gateway-failure-classifier.tssrc/lib/actions/sandbox/status-preflight.tssrc/lib/actions/sandbox/status-snapshot.tssrc/lib/actions/sandbox/status.tstest/cli.test.tstest/gateway-failure-classifier.test.ts
💤 Files with no reviewable changes (2)
- test/gateway-failure-classifier.test.ts
- test/cli.test.ts
✅ Files skipped from review due to trivial changes (1)
- docs/reference/commands.mdx
🚧 Files skipped from review as they are similar to previous changes (3)
- src/lib/actions/sandbox/docker-health.ts
- src/lib/actions/sandbox/gateway-failure-classifier.ts
- src/lib/actions/sandbox/status.ts
Selective E2E Results — ❌ Some jobs failedRun: 26790735708
|
|
Addressed the updated PR Review Advisor finding about What changed:
Validated locally:
Pushed head: |
Selective E2E Results — ✅ All requested jobs passedRun: 26791270599
|
Selective E2E Results — ❌ Some jobs failedRun: 26791599457
|
…rk (#4640) ## Summary `TC-SBX-09: Tmux Session Flow` has been failing on every scheduled nightly E2E run since #4606 merged. The first assertion (tmux binary present) still passes; the second assertion — drive a full detached `new-session` → `list-sessions` → `kill-session` cycle inside the sandbox — consistently fails with `create window failed: fork failed: Permission denied`. Root cause: OpenShell sandbox hardening (seccomp, `no-new-privileges`, `nproc=512` ulimit) blocks tmux's fork-to-spawn child window when invoked under the e2e SSH session account. The binary-presence assertion already covers the surface of issue #4513; the lifecycle drive depends on sandbox runtime capabilities that are environment-dependent and out of scope for this case. Degrade that branch to `skip` with the observed `fork failed` output so the suite reports the limitation without failing the nightly. Latest failing scheduled nightly: [run 26790528855](https://github.com/NVIDIA/NemoClaw/actions/runs/26790528855). Same failure also blocks PR review on [#4388](#4388) via inherited advisor reruns [run 26790735708](https://github.com/NVIDIA/NemoClaw/actions/runs/26790735708) and [run 26791599457](https://github.com/NVIDIA/NemoClaw/actions/runs/26791599457). ## Related Issue Follow-up to #4606 (which added TC-SBX-09 alongside the sandbox-image tmux pin). The PR body of #4606 noted *"A full image-build + live-sandbox E2E was not run in this environment"* — the lifecycle drive added by that PR turned out to be incompatible with the live OpenShell sandbox seccomp + capability profile, so every scheduled `E2E / Nightly` run since the merge has reported `sandbox-operations-e2e` as failing on this single assertion. This PR keeps the binary-presence guard from #4606 intact (the actual surface of #4513) while making the lifecycle drive a soft skip when the sandbox refuses to fork, so the nightly pipeline can go green again without masking real regressions (a non-`fork failed` error still hits the `fail` branch). ## Changes - `test/e2e/test-sandbox-operations.sh`: in `test_sbx_09_tmux_session_flow`, add a `skip` branch matching `fork failed: (Permission denied|Resource temporarily unavailable)` between the existing `pass`/`fail` branches; keeps best-effort `kill-session` cleanup before recording the skip. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `npm run docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Tinson Lai <tinsonl@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Improved tmux sandbox operations test to better detect and handle fork failures with enhanced error recovery and clearer skip messages. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…r-unreachable-header
|
Merged latest New head: Push-time hooks passed, including TypeScript CLI and source-shape budget. CI/E2E advisor checks are pending on the new head. |
Selective E2E Results — ✅ All requested jobs passedRun: 26795813868
|
## Summary - Add the missing `v0.0.57` release-notes section with links to the detailed docs pages for command, inference, onboarding, messaging, status, installer, and policy changes. - Remove public references to docs-skip terms from source docs and regenerate the NemoClaw user skills from the current Fern MDX docs. - Carry forward generated references for the per-agent documentation split, including Hermes-specific reference files. ## Source summary - #4615 and #4653 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover host-side `sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON` secondary-agent baking. - #4163, #4204, #4611, #4619, and #4676 -> `docs/about/release-notes.mdx`, `docs/inference/use-local-inference.mdx`: Release notes now cover managed vLLM progress/readiness, DGX Spark model default changes, local Ollama streaming usage, and inference route divergence warnings. - #4267, #4601, #4609, #4642, #4645, and #4661 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover UFW auto-remediation, local-inference reachability gates, gateway reuse/binding, cancel rollback, and policy selection persistence. - #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and Slack placeholder normalization. - #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover status failure layers, paused-container hints, Docker-driver doctor behavior, and non-destructive stale-registry recovery. - #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/network-policy/integration-policy-examples.mdx`: Release notes now cover installer tag pinning, PyPI `uv` policy access, and observable Jira validation. - #4632 -> `.agents/skills/`: Regenerated user skills from the current per-agent docs source, including newly generated Hermes reference files. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" docs --glob "*.mdx"` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" .agents/skills --glob "*.md"` - `npm run docs` - `npm run build:cli` - Commit hooks: markdownlint, docs-to-skills verification, gitleaks, skills YAML, commitlint <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Restructured documentation to clearly distinguish OpenClaw and Hermes agent variants throughout user guides. * Enhanced security, credential storage, and deployment guidance with clearer setup flows. * Added Hermes plugin installation and ecosystem documentation. * Improved workspace, messaging, and policy management references with variant-specific command examples. * Refined troubleshooting and CLI reference sections for clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
sandbox status(text and--json) on docker-driver sandboxes now classifies a stopped local stack before the inference probe runs, so the host-sideInferenceline never falsely reports healthy when the daemon is down, the per-sandbox container is stopped, or the dashboard port is held by a foreign listener. The classification surfaces as a verbatim failure-layer header in human output and a newfailureLayerfield in the JSON report, with exit code 1 in both paths.Related Issue
Fixes #4313
Fixes #4515
Changes
sandbox_container_stoppedandsandbox_dashboard_port_conflictlayers ingateway-failure-classifier.tsplusclassifySandboxContainerFailure(gated on integer dashboard port[1, 65535]).resolveSandboxContainerOwnerintosrc/lib/actions/sandbox/sandbox-container-owner.ts; reuse from bothgateway-failure-classifier.tsanddocker-health.tsso the longest-owner rule is the single source of truth.classifySandboxStatusPreflightFailureinstatus.ts; bothshowSandboxStatusandgetSandboxStatusReportrun it first, then passsuppressInferenceProbeintocollectSandboxStatusSnapshotvia the newmaybeGetSandboxStatusInferenceHealthgate so the remote provider probe is never issued when the classifier already reported a failure.failureLayeronSandboxStatusReport;src/commands/sandbox/status.tssetsprocess.exitCode = 1when it is non-null.docs/reference/commands.mdxto describe the new failure layers.src/lib/actions/sandbox/status.test.ts,test/gateway-failure-classifier.test.ts,test/sandbox-container-owner.test.ts, and four new--jsonintegration cases intest/cli.test.ts(including a realnet.createServerforeign listener for the dashboard-port-conflict path).Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests