fix(sandbox): match re-execed plain-openclaw gateway argv in HEALTHCHECK (#4952) by jason-ma-nv · Pull Request #4958 · NVIDIA/NemoClaw

jason-ma-nv · 2026-06-08T12:20:14Z

Summary

The sandbox container Docker HEALTHCHECK reported every NemoClaw sandbox as (unhealthy) even when the gateway was alive and serving. Recent OpenClaw (v0.0.44 / 2026.5.18+) re-execs the long-running gateway into a process whose argv is plain openclaw with no gateway token, which the gateway-liveness fallback's pgrep -f 'openclaw[ -]gateway' could not match. This adds a pgrep -x openclaw fallback so the re-execed form is recognized.

Related Issue

Fixes #4952

Changes

Dockerfile: the HEALTHCHECK gateway-liveness fallback now tries pgrep --ignore-ancestors -f 'openclaw[ -]gateway' first and falls back to pgrep --ignore-ancestors -x openclaw, matching the re-execed plain-openclaw argv by exact process name. This mirrors the established gateway_pid() helper in test/e2e/test-issue-2478-crash-loop-recovery.sh (gateway-token match first, bare-openclaw fallback second). The surrounding comment is updated to explain the new form.
test/sandbox-provisioning.test.ts: new #4952 describe block that drives a pgrep mock which actually matches its -f/-x pattern against a simulated process table — so the probe outcome depends on the real argv shape rather than a forced exit code (which the existing runProductionHealthProbe harness cannot exercise). Covers the plain-openclaw argv, the launcher openclaw gateway run form, the legacy openclaw-gateway form, and the no-process case.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
npm run docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: jama@nvidia.com jama@nvidia.com

Summary by CodeRabbit

Bug Fixes
- More reliable container healthchecks to avoid false "unhealthy" states by recognizing re-execed gateway variants and validating recorded gateway PID.
New Features
- HEALTHCHECK now falls back to a persisted gateway PID when process argv patterns are absent.
Documentation
- Expanded HEALTHCHECK docs and rationale covering newer gateway re-exec behavior.
Tests
- Added regression and unit tests covering re-exec, PID-recording, and various healthy/unhealthy scenarios.

…ECK (#4952) Recent OpenClaw (v0.0.44 / 2026.5.18+) re-execs the long-running gateway into a process whose argv is plain `openclaw` with no `gateway` token. On runtime shapes where the in-container curl probe fails (connection refused, exit 7) and the /tmp/nemoclaw-gateway-local marker is present, the gateway-liveness fallback used `pgrep -f 'openclaw[ -]gateway'`, which cannot see the re-execed process — so the container was reported permanently unhealthy even though the gateway was alive and serving. Add a `pgrep -x openclaw` fallback that matches the re-execed form by exact process name, mirroring the gateway_pid() helper in test/e2e/test-issue-2478-crash-loop-recovery.sh (gateway-token match first, bare-openclaw fallback second). Cover the regression in test/sandbox-provisioning.test.ts with a pgrep mock that matches its -f/-x pattern against a simulated process table, so the probe outcome depends on the real argv shape rather than a forced exit code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jason <jama@nvidia.com>

coderabbitai · 2026-06-08T12:20:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 45bc76ba-3150-400c-ba89-acaaeac77334

📥 Commits

Reviewing files that changed from the base of the PR and between 4a4122c and 998c642.

📒 Files selected for processing (1)

test/nemoclaw-start.test.ts

📝 Walkthrough

Walkthrough

Start script now records gateway PID to /tmp/nemoclaw-gateway.pid; Dockerfile HEALTHCHECK first tries pgrep for gateway argv and falls back to verifying the recorded PID's command name when curl reports connection refused. Tests validate PID recording and the HEALTHCHECK probe across argv variants and failure modes.

Changes

Sandbox Healthcheck Process Detection Fix

Layer / File(s)	Summary
Healthcheck comment and fallback detection logic `Dockerfile`	HEALTHCHECK docs updated; when curl returns connection refused the probe first tries `pgrep -f 'openclaw[ -]gateway'` and, if absent, reads `/tmp/nemoclaw-gateway.pid` and verifies that PID's `ps -o comm=` matches `openclaw*`; otherwise container is unhealthy.
record_gateway_pid helper and invocation points `scripts/nemoclaw-start.sh`	Adds `record_gateway_pid()` (best-effort write to `/tmp/nemoclaw-gateway.pid`) and invokes it immediately after gateway launch and after each respawn in both non-root and root modes.
Tests for record_gateway_pid behaviour `test/gateway-pid-recording.test.ts`	Extracts the shell helper from the start script, tests that it writes a provided PID to a temp file, and verifies it exits successfully when the target path is unwritable (best-effort semantics).
HEALTHCHECK regression tests (argv variants & failure modes) `test/sandbox-provisioning.test.ts`	Derives the HEALTHCHECK command from the Dockerfile and runs it with mocked `curl`, `pgrep`, and `ps` over simulated process tables and optional recorded gateway PID. Asserts healthy for plain re-execed `openclaw` when recorded PID resolves to `openclaw`, launcher-form `openclaw gateway run`, and legacy `openclaw-gateway`, and asserts unhealthy for missing/invalid recorded PID and PID reuse cases.
Test harness stub update `test/nemoclaw-start.test.ts`	Stubs `record_gateway_pid()` (no-op) alongside `cleanup_on_signal()` in the gateway launch test harness so generated scripts reference the symbol during tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

[All Platforms][CLI&UX] nemoclaw tunnel stop reports gateway not running and exits 0 while in-sandbox openclaw gateway keeps running #4951: Addresses the same argv-rewrite detection gap; implements PID-file-based detection similar to this PR.

Possibly related PRs

NVIDIA/NemoClaw#4748: Related HEALTHCHECK/startup gating changes touching sandbox health detection.

Suggested labels

bug-fix, platform: container, Docker, Sandbox, v0.0.60

Suggested reviewers

cv
prekshivyas
jyaunches

Poem

🐰 I hopped through scripts and Dockerfile lines,
I left a PID trail for the healthcheck to find,
When argv hides the gateway as just "openclaw",
My small note shows which PID it saw,
Now containers can say "I'm healthy!" — hop hooray!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main fix: updating the HEALTHCHECK to handle re-execed gateway processes with plain 'openclaw' argv.
Linked Issues check	✅ Passed	The PR comprehensively addresses issue `#4952` by fixing both the pgrep fallback pattern and implementing a recorded PID verification mechanism with proper test coverage.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing the HEALTHCHECK and gateway PID recording—no extraneous modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/4952-healthcheck-pgrep-plain-openclaw

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-08T12:23:04Z

E2E Advisor Recommendation

Required E2E: issue-2478-crash-loop-recovery-e2e, sandbox-survival-e2e, test-e2e-gateway-isolation
Optional E2E: test-non-root-sandbox-smoke, sandbox-operations-e2e, test-e2e-port-overrides

Dispatch hint: issue-2478-crash-loop-recovery-e2e,sandbox-survival-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

issue-2478-crash-loop-recovery-e2e (high): Closest existing live regression for gateway PID/argv/respawn behavior. It onboards a real sandbox, observes current OpenClaw gateway process shapes including plain openclaw, kills and recovers the gateway repeatedly, and verifies inference remains available.
sandbox-survival-e2e (high): Validates the real user-facing sandbox lifecycle after gateway restarts: onboard, sandbox discoverability, SSH/connect/status, persisted workspace/state, and live inference after gateway stop/start.
test-e2e-gateway-isolation (medium): Builds the production image and runs image-level gateway/entrypoint hardening checks. This is required because both the Dockerfile and nemoclaw-start production entrypoint changed.

Optional E2E

test-non-root-sandbox-smoke (low): Useful adjacent coverage for the non-root entrypoint path under no-new-privileges. The PR touches the non-root gateway launch path, although this smoke does not launch a long-running gateway.
sandbox-operations-e2e (high): Broader sandbox lifecycle confidence, especially TC-SBX-08 process recovery and status/connect behavior after gateway process disruption.
test-e2e-port-overrides (low): Optional image-level confidence for dashboard/gateway port handling near the changed HEALTHCHECK block.

New E2E recommendations

docker-image-healthcheck (high): Existing live E2Es do not appear to force the exact Docker HEALTHCHECK fallback path: marker present, in-container curl exit 7, current OpenClaw re-exec argv as plain openclaw, and liveness proven only through /tmp/nemoclaw-gateway.pid.
- Suggested test: Add a production-image healthcheck E2E that starts the image with a fake or controlled OpenClaw gateway re-execing to plain openclaw, forces the healthcheck curl path to fail with connection-refused, asserts Docker health stays healthy while the recorded PID is live, and asserts it becomes unhealthy when the recorded PID dies or is reused by a non-openclaw process.
sandbox-gateway-lifecycle (medium): The new PID file is refreshed on every respawn, but existing live crash-loop coverage does not explicitly assert /tmp/nemoclaw-gateway.pid matches the current in-container gateway PID after each respawn.
- Suggested test: Extend the crash-loop recovery E2E to read /tmp/nemoclaw-gateway.pid after initial launch and after each forced respawn, verifying it points to the live gateway process and changes when the gateway respawns.

Dispatch hint

Workflow: .github/workflows/nightly-e2e.yaml
jobs input: issue-2478-crash-loop-recovery-e2e,sandbox-survival-e2e

github-actions · 2026-06-08T12:23:05Z

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw
Optional scenario E2E: wsl-repo-cloud-openclaw, gpu-repo-local-ollama-openclaw

Dispatch required scenario E2E:

gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

ubuntu-repo-cloud-openclaw: Docker image HEALTHCHECK logic and scripts/nemoclaw-start.sh gateway PID recording affect repo-current Docker sandbox startup and gateway health. The canonical Ubuntu repo cloud OpenClaw scenario exercises the PR-built image, entrypoint, gateway startup, sandbox running state, and smoke gateway-health path.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional scenario E2E

wsl-repo-cloud-openclaw: Optional adjacent coverage for the same repo-current Docker/OpenClaw startup and gateway-health surface on WSL, a special-runner/runtime shape where namespace and Docker behavior can differ.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=wsl-repo-cloud-openclaw
gpu-repo-local-ollama-openclaw: Optional adjacent coverage for repo-current Docker startup and gateway health on the GPU/CDI runner. This is a special-runner scenario, so it is optional unless GPU-specific regressions are suspected.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Relevant changed files

Dockerfile
scripts/nemoclaw-start.sh

github-actions · 2026-06-08T12:23:37Z

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 1 still applies, 0 new items found

Review findings

🛠️ Needs attention

None.

🔎 Worth checking

Add runtime validation for the Docker HEALTHCHECK state (Dockerfile:963): The changed shell-snippet and helper tests cover the HEALTHCHECK branches well, including the plain `openclaw` argv and negative PID cases, but Issue [All Platforms][Sandbox] sandbox container HEALTHCHECK always unhealthy — gateway not on container loopback and pgrep fallback uses wrong argv pattern #4952's observable acceptance condition is Docker reporting the built sandbox container as healthy. This path depends on Docker HEALTHCHECK scheduling, the real process table, entrypoint PID recording, and gateway respawn behavior, none of which are exercised by the changed tests.
- Recommendation: Add or identify a built-container validation that checks Docker `Health.Status=healthy` when in-container curl gets connection refused but `/tmp/nemoclaw-gateway.pid` points to the live re-execed plain-`openclaw` gateway, and `Health.Status=unhealthy` when the gateway PID is dead while an unrelated `openclaw agent ...` process and stale/non-empty `/tmp/gateway.log` exist.
- Evidence: Issue [All Platforms][Sandbox] sandbox container HEALTHCHECK always unhealthy — gateway not on container loopback and pgrep fallback uses wrong argv pattern #4952 expects `docker ps` to show `Up N minutes (healthy)` and `docker inspect` to return `healthy`. The PR adds Dockerfile PID fallback logic plus shell-snippet tests in `test/sandbox-provisioning.test.ts` and producer tests in `test/gateway-pid-recording.test.ts`, but no changed file validates actual Docker health state of a built container.

🌱 Nice ideas

None.

Consider writing more tests for

**Runtime validation** — Built sandbox container reports Docker `Health.Status=healthy` when in-container curl returns connection refused, `/tmp/nemoclaw-gateway-local` exists, and `/tmp/nemoclaw-gateway.pid` points to a live plain-`openclaw` gateway.. The branch-level shell tests are strong, but this is sandbox lifecycle behavior whose observable outcome depends on Docker HEALTHCHECK scheduling, actual process-table behavior, entrypoint PID recording, and respawn timing.
**Runtime validation** — Built sandbox container reports Docker `Health.Status=unhealthy` when `/tmp/gateway.log` is stale/non-empty, the recorded gateway PID is dead, and an unrelated `openclaw agent ...` process exists.. The branch-level shell tests are strong, but this is sandbox lifecycle behavior whose observable outcome depends on Docker HEALTHCHECK scheduling, actual process-table behavior, entrypoint PID recording, and respawn timing.
**Runtime validation** — Respawned gateway updates `/tmp/nemoclaw-gateway.pid` and Docker health recovers after the new gateway process is launched.. The branch-level shell tests are strong, but this is sandbox lifecycle behavior whose observable outcome depends on Docker HEALTHCHECK scheduling, actual process-table behavior, entrypoint PID recording, and respawn timing.
**Runtime validation** — If stronger PID identity is added later, a stale recorded PID reused by another plain-`openclaw` non-gateway process does not falsely keep Docker healthy.. The branch-level shell tests are strong, but this is sandbox lifecycle behavior whose observable outcome depends on Docker HEALTHCHECK scheduling, actual process-table behavior, entrypoint PID recording, and respawn timing.
**Add runtime validation for the Docker HEALTHCHECK state** — Add or identify a built-container validation that checks Docker `Health.Status=healthy` when in-container curl gets connection refused but `/tmp/nemoclaw-gateway.pid` points to the live re-execed plain-`openclaw` gateway, and `Health.Status=unhealthy` when the gateway PID is dead while an unrelated `openclaw agent ...` process and stale/non-empty `/tmp/gateway.log` exist.
**Acceptance clause:** Issue [All Platforms][Sandbox] sandbox container HEALTHCHECK always unhealthy — gateway not on container loopback and pgrep fallback uses wrong argv pattern #4952: "The sandbox container HEALTHCHECK defined in `Dockerfile:943-954` always returns exit 1 (unhealthy), even when the gateway is fully functional." — add test evidence or identify existing coverage. The Dockerfile HEALTHCHECK now falls back to `/tmp/nemoclaw-gateway.pid` when curl returns rc=7 and the local-gateway marker exists, and `test/sandbox-provisioning.test.ts` expects status 0 for the plain-`openclaw` gateway case. The actual built-container Docker health status is not validated in changed files.
**Acceptance clause:** Issue [All Platforms][Sandbox] sandbox container HEALTHCHECK always unhealthy — gateway not on container loopback and pgrep fallback uses wrong argv pattern #4952: "Impact: container always shows `(unhealthy)` in `docker ps`. Any monitoring/alerting based on docker health (Prometheus exporter, dashboard, ops alerts) reports false-positive unhealthy state for fully-functional sandboxes." — add test evidence or identify existing coverage. The shell command now returns healthy for the simulated false-unhealthy branch, but there is no changed-file evidence that `docker ps` or a monitor observing Docker health now sees the built container as healthy.
**Acceptance clause:** Issue [All Platforms][Sandbox] sandbox container HEALTHCHECK always unhealthy — gateway not on container loopback and pgrep fallback uses wrong argv pattern #4952: "Expected Result — Step 3: `docker ps` shows `Up N minutes (healthy)`, `Health.Status` returns `healthy`." — add test evidence or identify existing coverage. The Dockerfile command returns status 0 in shell-snippet tests for the plain-argv gateway case, but the PR does not add or identify a built-container Docker `Health.Status=healthy` validation.

Since last review details

Current findings:

Add runtime validation for the Docker HEALTHCHECK state (Dockerfile:963): The changed shell-snippet and helper tests cover the HEALTHCHECK branches well, including the plain `openclaw` argv and negative PID cases, but Issue [All Platforms][Sandbox] sandbox container HEALTHCHECK always unhealthy — gateway not on container loopback and pgrep fallback uses wrong argv pattern #4952's observable acceptance condition is Docker reporting the built sandbox container as healthy. This path depends on Docker HEALTHCHECK scheduling, the real process table, entrypoint PID recording, and gateway respawn behavior, none of which are exercised by the changed tests.
- Recommendation: Add or identify a built-container validation that checks Docker `Health.Status=healthy` when in-container curl gets connection refused but `/tmp/nemoclaw-gateway.pid` points to the live re-execed plain-`openclaw` gateway, and `Health.Status=unhealthy` when the gateway PID is dead while an unrelated `openclaw agent ...` process and stale/non-empty `/tmp/gateway.log` exist.
- Evidence: Issue [All Platforms][Sandbox] sandbox container HEALTHCHECK always unhealthy — gateway not on container loopback and pgrep fallback uses wrong argv pattern #4952 expects `docker ps` to show `Up N minutes (healthy)` and `docker inspect` to return `healthy`. The PR adds Dockerfile PID fallback logic plus shell-snippet tests in `test/sandbox-provisioning.test.ts` and producer tests in `test/gateway-pid-recording.test.ts`, but no changed file validates actual Docker health state of a built container.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

…ay PID (#4952) Tighten the re-execed-gateway fallback added for #4952. A bare `pgrep -x openclaw` only proves *some* process named `openclaw` exists, so a marker-present container with a stale non-empty /tmp/gateway.log plus an unrelated `openclaw` one-shot (e.g. `openclaw agent ...`) could keep Docker healthy after the real gateway died, weakening restart/self-healing. nemoclaw-start now records the live gateway PID in /tmp/nemoclaw-gateway.pid (record_gateway_pid, written on both the root and non-root launch paths and refreshed on every respawn). When the `openclaw[ -]gateway` pgrep pattern misses, the HEALTHCHECK confirms THAT recorded PID is still a live `openclaw` process via `ps -p <pid> -o comm=`, with a comm-prefix guard against PID reuse. This is gateway-specific and no longer fooled by a non-gateway `openclaw` process. Tests: - test/sandbox-provisioning.test.ts: drive the fallback with a recorded PID file + ps mock; add the reviewer's case (stray non-gateway `openclaw` + dead recorded PID -> unhealthy) and a PID-reuse case. - test/gateway-pid-recording.test.ts (new, focused file): record_gateway_pid writes the PID file the HEALTHCHECK reads and never fails startup on a write error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jason <jama@nvidia.com>

jason-ma-nv · 2026-06-08T13:17:39Z

Pushed a follow-up addressing the review feedback.

PR Review Advisor — "Tighten bare openclaw fallback" (the self-healing gap): fixed. The bare pgrep -x openclaw fallback is replaced with a gateway-specific check: nemoclaw-start now records the live gateway PID in /tmp/nemoclaw-gateway.pid (record_gateway_pid, written on both the root and non-root launch paths and refreshed on every respawn), and when the openclaw[ -]gateway pgrep pattern misses, the HEALTHCHECK confirms that recorded PID is still a live openclaw process via ps -p <pid> -o comm= (with a comm-prefix guard against PID reuse). A stale /tmp/gateway.log plus an unrelated openclaw one-shot can no longer keep the container green after the real gateway dies — restart/self-healing is preserved.

New/updated tests:

test/sandbox-provisioning.test.ts: the fallback is now exercised with a recorded PID file + ps mock. Added the exact case you flagged — stray non-gateway openclaw present + recorded gateway PID dead → unhealthy — plus a PID-reuse case.
test/gateway-pid-recording.test.ts (new focused file): proves record_gateway_pid writes the file the HEALTHCHECK reads and never aborts startup on a write error.

Linked Issues check (curl probe): the curl exit-7 path is handled by design — it falls back rather than failing — so the "always unhealthy" symptom was driven by the liveness fallback, which is what this PR corrects. Binding the gateway to loopback / probing its actual listener address is a separate concern best tracked on its own; happy to open a follow-up if maintainers prefer the broader change here.

E2E advisors: the suggested real-container assertion of .State.Health.Status under the #4952 shape is a good addition but lives outside this unit/shell-snippet scope — noting it for issue-2478-crash-loop-recovery-e2e / a dedicated production-HEALTHCHECK e2e.

coderabbitai

🧹 Nitpick comments (2)

Dockerfile (1)
964-979: Run the container-level E2E health assertions for this HEALTHCHECK change.

Given this is Docker HEALTHCHECK behavior, please validate with the recommended E2E jobs to confirm .State.Health.Status behavior in a real container runtime.

As per coding guidelines, Dockerfile-layer behavior is only fully testable with real container builds and should be validated via the listed nightly E2E jobs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Dockerfile` around lines 964 - 979, Run container-level E2E validation for
the new HEALTHCHECK snippet: build and run the image and exercise the
HEALTHCHECK path that reads
NEMOCLAW_DASHBOARD_PORT/OPENCLAW_GATEWAY_PORT/CHAT_UI_URL, probes
http://127.0.0.1:${port}/health, and falls back to the local gateway checks
(/tmp/nemoclaw-gateway-local, /tmp/nemoclaw-gateway.pid, process matching
'openclaw gateway', and /tmp/gateway.log); use the recommended nightly E2E jobs
to confirm the container .State.Health.Status transitions for success, curl
connection refused (rc=7) and other failures, and document any runtime
differences from the Dockerfile logic.
Source: Coding guidelines
scripts/nemoclaw-start.sh (1)
3185-3594: Run the entrypoint-focused E2E suite for restart and recovery semantics.

This PID-recording path affects every sandbox boot and respawn loop; please validate with the recommended sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e, and openclaw-slack-pairing-e2e jobs.

As per coding guidelines, scripts/nemoclaw-start.sh changes are not fully covered by unit tests and should be verified via the specified nightly E2E workflows.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/nemoclaw-start.sh` around lines 3185 - 3594, Run the
entrypoint-focused E2E suites (sandbox-survival-e2e, sandbox-operations-e2e,
cloud-e2e, openclaw-slack-pairing-e2e) to validate the new non-root and root
PID/respawn behavior: exercise gateway start/stop/crash scenarios and confirm
record_gateway_pid writes the expected PID, SANDBOX_CHILD_PIDS and
SANDBOX_WAIT_PID are populated correctly after launches and respawns, the
respawn sliding-window logic (RESPAWN_TIMES/RESPAWN_COUNT) enforces the intended
throttling/alerts, the persistent log mirror started by
start_persistent_gateway_log_mirror captures /tmp/gateway.log into the durable
sandbox log, and validate_tmp_permissions/path ownership fixes
(fix_openclaw_ownership, provision_agent_workspaces,
seed_default_workspace_templates_as_sandbox) preserve permissions so restarts
and auto-pairing succeed; file any failures or flakes as regressions against
these functions for follow-up.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@Dockerfile`:
- Around line 964-979: Run container-level E2E validation for the new
HEALTHCHECK snippet: build and run the image and exercise the HEALTHCHECK path
that reads NEMOCLAW_DASHBOARD_PORT/OPENCLAW_GATEWAY_PORT/CHAT_UI_URL, probes
http://127.0.0.1:${port}/health, and falls back to the local gateway checks
(/tmp/nemoclaw-gateway-local, /tmp/nemoclaw-gateway.pid, process matching
'openclaw gateway', and /tmp/gateway.log); use the recommended nightly E2E jobs
to confirm the container .State.Health.Status transitions for success, curl
connection refused (rc=7) and other failures, and document any runtime
differences from the Dockerfile logic.

In `@scripts/nemoclaw-start.sh`:
- Around line 3185-3594: Run the entrypoint-focused E2E suites
(sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e,
openclaw-slack-pairing-e2e) to validate the new non-root and root PID/respawn
behavior: exercise gateway start/stop/crash scenarios and confirm
record_gateway_pid writes the expected PID, SANDBOX_CHILD_PIDS and
SANDBOX_WAIT_PID are populated correctly after launches and respawns, the
respawn sliding-window logic (RESPAWN_TIMES/RESPAWN_COUNT) enforces the intended
throttling/alerts, the persistent log mirror started by
start_persistent_gateway_log_mirror captures /tmp/gateway.log into the durable
sandbox log, and validate_tmp_permissions/path ownership fixes
(fix_openclaw_ownership, provision_agent_workspaces,
seed_default_workspace_templates_as_sandbox) preserve permissions so restarts
and auto-pairing succeed; file any failures or flakes as regressions against
these functions for follow-up.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9807d797-5212-49f4-8643-dd3a7b8c6702

📥 Commits

Reviewing files that changed from the base of the PR and between 2684c22 and 4a4122c.

📒 Files selected for processing (4)

Dockerfile
scripts/nemoclaw-start.sh
test/gateway-pid-recording.test.ts
test/sandbox-provisioning.test.ts

🚧 Files skipped from review as they are similar to previous changes (1)

test/sandbox-provisioning.test.ts

…4952) The runLaunchBlock harness in nemoclaw-start.test.ts extracts the gateway launch block and runs it with its helper calls stubbed. The new record_gateway_pid call added for #4952 was not stubbed, so the extracted block hit an undefined command (exit 127) and failed the two launch-block tests in CI. Stub it alongside cleanup_on_signal (kept on one line to stay within the test-file-size budget) so the extracted block does not write the host /tmp during the test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jason <jama@nvidia.com>

wscurran · 2026-06-08T14:44:09Z

✨
Related open issues:

#4952 [All Platforms][Sandbox] sandbox container HEALTHCHECK always unhealthy — gateway not on container loopback and pgrep fallback uses wrong argv pattern

jason-ma-nv self-assigned this Jun 8, 2026

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

wscurran added area: integrations Third-party service integration behavior area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression integration: openclaw OpenClaw integration behavior labels Jun 8, 2026

Conversation

jason-ma-nv commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Scenario Advisor Recommendation

E2E Scenario Advisor

Required scenario E2E

Optional scenario E2E

Relevant changed files

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Advisor

🛠️ Needs attention

🔎 Worth checking

🌱 Nice ideas

Uh oh!

jason-ma-nv commented Jun 8, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

wscurran commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jason-ma-nv commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading