Skip to content

fix(sandbox): match re-execed plain-openclaw gateway argv in HEALTHCHECK (#4952)#4958

Open
jason-ma-nv wants to merge 3 commits into
mainfrom
fix/4952-healthcheck-pgrep-plain-openclaw
Open

fix(sandbox): match re-execed plain-openclaw gateway argv in HEALTHCHECK (#4952)#4958
jason-ma-nv wants to merge 3 commits into
mainfrom
fix/4952-healthcheck-pgrep-plain-openclaw

Conversation

@jason-ma-nv
Copy link
Copy Markdown
Contributor

@jason-ma-nv jason-ma-nv commented Jun 8, 2026

Summary

The sandbox container Docker HEALTHCHECK reported every NemoClaw sandbox as (unhealthy) even when the gateway was alive and serving. Recent OpenClaw (v0.0.44 / 2026.5.18+) re-execs the long-running gateway into a process whose argv is plain openclaw with no gateway token, which the gateway-liveness fallback's pgrep -f 'openclaw[ -]gateway' could not match. This adds a pgrep -x openclaw fallback so the re-execed form is recognized.

Related Issue

Fixes #4952

Changes

  • Dockerfile: the HEALTHCHECK gateway-liveness fallback now tries pgrep --ignore-ancestors -f 'openclaw[ -]gateway' first and falls back to pgrep --ignore-ancestors -x openclaw, matching the re-execed plain-openclaw argv by exact process name. This mirrors the established gateway_pid() helper in test/e2e/test-issue-2478-crash-loop-recovery.sh (gateway-token match first, bare-openclaw fallback second). The surrounding comment is updated to explain the new form.
  • test/sandbox-provisioning.test.ts: new #4952 describe block that drives a pgrep mock which actually matches its -f/-x pattern against a simulated process table — so the probe outcome depends on the real argv shape rather than a forced exit code (which the existing runProductionHealthProbe harness cannot exercise). Covers the plain-openclaw argv, the launcher openclaw gateway run form, the legacy openclaw-gateway form, and the no-process case.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: jama@nvidia.com jama@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • More reliable container healthchecks to avoid false "unhealthy" states by recognizing re-execed gateway variants and validating recorded gateway PID.
  • New Features

    • HEALTHCHECK now falls back to a persisted gateway PID when process argv patterns are absent.
  • Documentation

    • Expanded HEALTHCHECK docs and rationale covering newer gateway re-exec behavior.
  • Tests

    • Added regression and unit tests covering re-exec, PID-recording, and various healthy/unhealthy scenarios.

…ECK (#4952)

Recent OpenClaw (v0.0.44 / 2026.5.18+) re-execs the long-running gateway
into a process whose argv is plain `openclaw` with no `gateway` token. On
runtime shapes where the in-container curl probe fails (connection refused,
exit 7) and the /tmp/nemoclaw-gateway-local marker is present, the
gateway-liveness fallback used `pgrep -f 'openclaw[ -]gateway'`, which cannot
see the re-execed process — so the container was reported permanently
unhealthy even though the gateway was alive and serving.

Add a `pgrep -x openclaw` fallback that matches the re-execed form by exact
process name, mirroring the gateway_pid() helper in
test/e2e/test-issue-2478-crash-loop-recovery.sh (gateway-token match first,
bare-openclaw fallback second).

Cover the regression in test/sandbox-provisioning.test.ts with a pgrep mock
that matches its -f/-x pattern against a simulated process table, so the
probe outcome depends on the real argv shape rather than a forced exit code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: jason <jama@nvidia.com>
@jason-ma-nv jason-ma-nv self-assigned this Jun 8, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 8, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 45bc76ba-3150-400c-ba89-acaaeac77334

📥 Commits

Reviewing files that changed from the base of the PR and between 4a4122c and 998c642.

📒 Files selected for processing (1)
  • test/nemoclaw-start.test.ts

📝 Walkthrough

Walkthrough

Start script now records gateway PID to /tmp/nemoclaw-gateway.pid; Dockerfile HEALTHCHECK first tries pgrep for gateway argv and falls back to verifying the recorded PID's command name when curl reports connection refused. Tests validate PID recording and the HEALTHCHECK probe across argv variants and failure modes.

Changes

Sandbox Healthcheck Process Detection Fix

Layer / File(s) Summary
Healthcheck comment and fallback detection logic
Dockerfile
HEALTHCHECK docs updated; when curl returns connection refused the probe first tries pgrep -f 'openclaw[ -]gateway' and, if absent, reads /tmp/nemoclaw-gateway.pid and verifies that PID's ps -o comm= matches openclaw*; otherwise container is unhealthy.
record_gateway_pid helper and invocation points
scripts/nemoclaw-start.sh
Adds record_gateway_pid() (best-effort write to /tmp/nemoclaw-gateway.pid) and invokes it immediately after gateway launch and after each respawn in both non-root and root modes.
Tests for record_gateway_pid behaviour
test/gateway-pid-recording.test.ts
Extracts the shell helper from the start script, tests that it writes a provided PID to a temp file, and verifies it exits successfully when the target path is unwritable (best-effort semantics).
HEALTHCHECK regression tests (argv variants & failure modes)
test/sandbox-provisioning.test.ts
Derives the HEALTHCHECK command from the Dockerfile and runs it with mocked curl, pgrep, and ps over simulated process tables and optional recorded gateway PID. Asserts healthy for plain re-execed openclaw when recorded PID resolves to openclaw, launcher-form openclaw gateway run, and legacy openclaw-gateway, and asserts unhealthy for missing/invalid recorded PID and PID reuse cases.
Test harness stub update
test/nemoclaw-start.test.ts
Stubs record_gateway_pid() (no-op) alongside cleanup_on_signal() in the gateway launch test harness so generated scripts reference the symbol during tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/NemoClaw#4748: Related HEALTHCHECK/startup gating changes touching sandbox health detection.

Suggested labels

bug-fix, platform: container, Docker, Sandbox, v0.0.60

Suggested reviewers

  • cv
  • prekshivyas
  • jyaunches

Poem

🐰 I hopped through scripts and Dockerfile lines,
I left a PID trail for the healthcheck to find,
When argv hides the gateway as just "openclaw",
My small note shows which PID it saw,
Now containers can say "I'm healthy!" — hop hooray!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main fix: updating the HEALTHCHECK to handle re-execed gateway processes with plain 'openclaw' argv.
Linked Issues check ✅ Passed The PR comprehensively addresses issue #4952 by fixing both the pgrep fallback pattern and implementing a recorded PID verification mechanism with proper test coverage.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the HEALTHCHECK and gateway PID recording—no extraneous modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/4952-healthcheck-pgrep-plain-openclaw

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 8, 2026

E2E Advisor Recommendation

Required E2E: issue-2478-crash-loop-recovery-e2e, sandbox-survival-e2e, test-e2e-gateway-isolation
Optional E2E: test-non-root-sandbox-smoke, sandbox-operations-e2e, test-e2e-port-overrides

Dispatch hint: issue-2478-crash-loop-recovery-e2e,sandbox-survival-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • issue-2478-crash-loop-recovery-e2e (high): Closest existing live regression for gateway PID/argv/respawn behavior. It onboards a real sandbox, observes current OpenClaw gateway process shapes including plain openclaw, kills and recovers the gateway repeatedly, and verifies inference remains available.
  • sandbox-survival-e2e (high): Validates the real user-facing sandbox lifecycle after gateway restarts: onboard, sandbox discoverability, SSH/connect/status, persisted workspace/state, and live inference after gateway stop/start.
  • test-e2e-gateway-isolation (medium): Builds the production image and runs image-level gateway/entrypoint hardening checks. This is required because both the Dockerfile and nemoclaw-start production entrypoint changed.

Optional E2E

  • test-non-root-sandbox-smoke (low): Useful adjacent coverage for the non-root entrypoint path under no-new-privileges. The PR touches the non-root gateway launch path, although this smoke does not launch a long-running gateway.
  • sandbox-operations-e2e (high): Broader sandbox lifecycle confidence, especially TC-SBX-08 process recovery and status/connect behavior after gateway process disruption.
  • test-e2e-port-overrides (low): Optional image-level confidence for dashboard/gateway port handling near the changed HEALTHCHECK block.

New E2E recommendations

  • docker-image-healthcheck (high): Existing live E2Es do not appear to force the exact Docker HEALTHCHECK fallback path: marker present, in-container curl exit 7, current OpenClaw re-exec argv as plain openclaw, and liveness proven only through /tmp/nemoclaw-gateway.pid.
    • Suggested test: Add a production-image healthcheck E2E that starts the image with a fake or controlled OpenClaw gateway re-execing to plain openclaw, forces the healthcheck curl path to fail with connection-refused, asserts Docker health stays healthy while the recorded PID is live, and asserts it becomes unhealthy when the recorded PID dies or is reused by a non-openclaw process.
  • sandbox-gateway-lifecycle (medium): The new PID file is refreshed on every respawn, but existing live crash-loop coverage does not explicitly assert /tmp/nemoclaw-gateway.pid matches the current in-container gateway PID after each respawn.
    • Suggested test: Extend the crash-loop recovery E2E to read /tmp/nemoclaw-gateway.pid after initial launch and after each forced respawn, verifying it points to the live gateway process and changes when the gateway respawns.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: issue-2478-crash-loop-recovery-e2e,sandbox-survival-e2e

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 8, 2026

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw
Optional scenario E2E: wsl-repo-cloud-openclaw, gpu-repo-local-ollama-openclaw

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • ubuntu-repo-cloud-openclaw: Docker image HEALTHCHECK logic and scripts/nemoclaw-start.sh gateway PID recording affect repo-current Docker sandbox startup and gateway health. The canonical Ubuntu repo cloud OpenClaw scenario exercises the PR-built image, entrypoint, gateway startup, sandbox running state, and smoke gateway-health path.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional scenario E2E

  • wsl-repo-cloud-openclaw: Optional adjacent coverage for the same repo-current Docker/OpenClaw startup and gateway-health surface on WSL, a special-runner/runtime shape where namespace and Docker behavior can differ.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=wsl-repo-cloud-openclaw
  • gpu-repo-local-ollama-openclaw: Optional adjacent coverage for repo-current Docker startup and gateway health on the GPU/CDI runner. This is a special-runner scenario, so it is optional unless GPU-specific regressions are suspected.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Relevant changed files

  • Dockerfile
  • scripts/nemoclaw-start.sh

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 8, 2026

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 1 still applies, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — Built sandbox container reports Docker `Health.Status=healthy` when in-container curl returns connection refused, `/tmp/nemoclaw-gateway-local` exists, and `/tmp/nemoclaw-gateway.pid` points to a live plain-`openclaw` gateway.. The branch-level shell tests are strong, but this is sandbox lifecycle behavior whose observable outcome depends on Docker HEALTHCHECK scheduling, actual process-table behavior, entrypoint PID recording, and respawn timing.
  • **Runtime validation** — Built sandbox container reports Docker `Health.Status=unhealthy` when `/tmp/gateway.log` is stale/non-empty, the recorded gateway PID is dead, and an unrelated `openclaw agent ...` process exists.. The branch-level shell tests are strong, but this is sandbox lifecycle behavior whose observable outcome depends on Docker HEALTHCHECK scheduling, actual process-table behavior, entrypoint PID recording, and respawn timing.
  • **Runtime validation** — Respawned gateway updates `/tmp/nemoclaw-gateway.pid` and Docker health recovers after the new gateway process is launched.. The branch-level shell tests are strong, but this is sandbox lifecycle behavior whose observable outcome depends on Docker HEALTHCHECK scheduling, actual process-table behavior, entrypoint PID recording, and respawn timing.
  • **Runtime validation** — If stronger PID identity is added later, a stale recorded PID reused by another plain-`openclaw` non-gateway process does not falsely keep Docker healthy.. The branch-level shell tests are strong, but this is sandbox lifecycle behavior whose observable outcome depends on Docker HEALTHCHECK scheduling, actual process-table behavior, entrypoint PID recording, and respawn timing.
  • **Add runtime validation for the Docker HEALTHCHECK state** — Add or identify a built-container validation that checks Docker `Health.Status=healthy` when in-container curl gets connection refused but `/tmp/nemoclaw-gateway.pid` points to the live re-execed plain-`openclaw` gateway, and `Health.Status=unhealthy` when the gateway PID is dead while an unrelated `openclaw agent ...` process and stale/non-empty `/tmp/gateway.log` exist.
  • **Acceptance clause:** Issue [All Platforms][Sandbox] sandbox container HEALTHCHECK always unhealthy — gateway not on container loopback and pgrep fallback uses wrong argv pattern #4952: "The sandbox container HEALTHCHECK defined in `Dockerfile:943-954` always returns exit 1 (unhealthy), even when the gateway is fully functional." — add test evidence or identify existing coverage. The Dockerfile HEALTHCHECK now falls back to `/tmp/nemoclaw-gateway.pid` when curl returns rc=7 and the local-gateway marker exists, and `test/sandbox-provisioning.test.ts` expects status 0 for the plain-`openclaw` gateway case. The actual built-container Docker health status is not validated in changed files.
  • **Acceptance clause:** Issue [All Platforms][Sandbox] sandbox container HEALTHCHECK always unhealthy — gateway not on container loopback and pgrep fallback uses wrong argv pattern #4952: "Impact: container always shows `(unhealthy)` in `docker ps`. Any monitoring/alerting based on docker health (Prometheus exporter, dashboard, ops alerts) reports false-positive unhealthy state for fully-functional sandboxes." — add test evidence or identify existing coverage. The shell command now returns healthy for the simulated false-unhealthy branch, but there is no changed-file evidence that `docker ps` or a monitor observing Docker health now sees the built container as healthy.
  • **Acceptance clause:** Issue [All Platforms][Sandbox] sandbox container HEALTHCHECK always unhealthy — gateway not on container loopback and pgrep fallback uses wrong argv pattern #4952: "Expected Result — Step 3: `docker ps` shows `Up N minutes (healthy)`, `Health.Status` returns `healthy`." — add test evidence or identify existing coverage. The Dockerfile command returns status 0 in shell-snippet tests for the plain-argv gateway case, but the PR does not add or identify a built-container Docker `Health.Status=healthy` validation.
Since last review details

Current findings:

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

…ay PID (#4952)

Tighten the re-execed-gateway fallback added for #4952. A bare
`pgrep -x openclaw` only proves *some* process named `openclaw` exists, so a
marker-present container with a stale non-empty /tmp/gateway.log plus an
unrelated `openclaw` one-shot (e.g. `openclaw agent ...`) could keep Docker
healthy after the real gateway died, weakening restart/self-healing.

nemoclaw-start now records the live gateway PID in /tmp/nemoclaw-gateway.pid
(record_gateway_pid, written on both the root and non-root launch paths and
refreshed on every respawn). When the `openclaw[ -]gateway` pgrep pattern
misses, the HEALTHCHECK confirms THAT recorded PID is still a live `openclaw`
process via `ps -p <pid> -o comm=`, with a comm-prefix guard against PID
reuse. This is gateway-specific and no longer fooled by a non-gateway
`openclaw` process.

Tests:
- test/sandbox-provisioning.test.ts: drive the fallback with a recorded PID
  file + ps mock; add the reviewer's case (stray non-gateway `openclaw` +
  dead recorded PID -> unhealthy) and a PID-reuse case.
- test/gateway-pid-recording.test.ts (new, focused file): record_gateway_pid
  writes the PID file the HEALTHCHECK reads and never fails startup on a write
  error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: jason <jama@nvidia.com>
@jason-ma-nv
Copy link
Copy Markdown
Contributor Author

Pushed a follow-up addressing the review feedback.

PR Review Advisor — "Tighten bare openclaw fallback" (the self-healing gap): fixed. The bare pgrep -x openclaw fallback is replaced with a gateway-specific check: nemoclaw-start now records the live gateway PID in /tmp/nemoclaw-gateway.pid (record_gateway_pid, written on both the root and non-root launch paths and refreshed on every respawn), and when the openclaw[ -]gateway pgrep pattern misses, the HEALTHCHECK confirms that recorded PID is still a live openclaw process via ps -p <pid> -o comm= (with a comm-prefix guard against PID reuse). A stale /tmp/gateway.log plus an unrelated openclaw one-shot can no longer keep the container green after the real gateway dies — restart/self-healing is preserved.

New/updated tests:

  • test/sandbox-provisioning.test.ts: the fallback is now exercised with a recorded PID file + ps mock. Added the exact case you flagged — stray non-gateway openclaw present + recorded gateway PID dead → unhealthy — plus a PID-reuse case.
  • test/gateway-pid-recording.test.ts (new focused file): proves record_gateway_pid writes the file the HEALTHCHECK reads and never aborts startup on a write error.

Linked Issues check (curl probe): the curl exit-7 path is handled by design — it falls back rather than failing — so the "always unhealthy" symptom was driven by the liveness fallback, which is what this PR corrects. Binding the gateway to loopback / probing its actual listener address is a separate concern best tracked on its own; happy to open a follow-up if maintainers prefer the broader change here.

E2E advisors: the suggested real-container assertion of .State.Health.Status under the #4952 shape is a good addition but lives outside this unit/shell-snippet scope — noting it for issue-2478-crash-loop-recovery-e2e / a dedicated production-HEALTHCHECK e2e.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
Dockerfile (1)

964-979: Run the container-level E2E health assertions for this HEALTHCHECK change.

Given this is Docker HEALTHCHECK behavior, please validate with the recommended E2E jobs to confirm .State.Health.Status behavior in a real container runtime.

As per coding guidelines, Dockerfile-layer behavior is only fully testable with real container builds and should be validated via the listed nightly E2E jobs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Dockerfile` around lines 964 - 979, Run container-level E2E validation for
the new HEALTHCHECK snippet: build and run the image and exercise the
HEALTHCHECK path that reads
NEMOCLAW_DASHBOARD_PORT/OPENCLAW_GATEWAY_PORT/CHAT_UI_URL, probes
http://127.0.0.1:${port}/health, and falls back to the local gateway checks
(/tmp/nemoclaw-gateway-local, /tmp/nemoclaw-gateway.pid, process matching
'openclaw gateway', and /tmp/gateway.log); use the recommended nightly E2E jobs
to confirm the container .State.Health.Status transitions for success, curl
connection refused (rc=7) and other failures, and document any runtime
differences from the Dockerfile logic.

Source: Coding guidelines

scripts/nemoclaw-start.sh (1)

3185-3594: Run the entrypoint-focused E2E suite for restart and recovery semantics.

This PID-recording path affects every sandbox boot and respawn loop; please validate with the recommended sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e, and openclaw-slack-pairing-e2e jobs.

As per coding guidelines, scripts/nemoclaw-start.sh changes are not fully covered by unit tests and should be verified via the specified nightly E2E workflows.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/nemoclaw-start.sh` around lines 3185 - 3594, Run the
entrypoint-focused E2E suites (sandbox-survival-e2e, sandbox-operations-e2e,
cloud-e2e, openclaw-slack-pairing-e2e) to validate the new non-root and root
PID/respawn behavior: exercise gateway start/stop/crash scenarios and confirm
record_gateway_pid writes the expected PID, SANDBOX_CHILD_PIDS and
SANDBOX_WAIT_PID are populated correctly after launches and respawns, the
respawn sliding-window logic (RESPAWN_TIMES/RESPAWN_COUNT) enforces the intended
throttling/alerts, the persistent log mirror started by
start_persistent_gateway_log_mirror captures /tmp/gateway.log into the durable
sandbox log, and validate_tmp_permissions/path ownership fixes
(fix_openclaw_ownership, provision_agent_workspaces,
seed_default_workspace_templates_as_sandbox) preserve permissions so restarts
and auto-pairing succeed; file any failures or flakes as regressions against
these functions for follow-up.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@Dockerfile`:
- Around line 964-979: Run container-level E2E validation for the new
HEALTHCHECK snippet: build and run the image and exercise the HEALTHCHECK path
that reads NEMOCLAW_DASHBOARD_PORT/OPENCLAW_GATEWAY_PORT/CHAT_UI_URL, probes
http://127.0.0.1:${port}/health, and falls back to the local gateway checks
(/tmp/nemoclaw-gateway-local, /tmp/nemoclaw-gateway.pid, process matching
'openclaw gateway', and /tmp/gateway.log); use the recommended nightly E2E jobs
to confirm the container .State.Health.Status transitions for success, curl
connection refused (rc=7) and other failures, and document any runtime
differences from the Dockerfile logic.

In `@scripts/nemoclaw-start.sh`:
- Around line 3185-3594: Run the entrypoint-focused E2E suites
(sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e,
openclaw-slack-pairing-e2e) to validate the new non-root and root PID/respawn
behavior: exercise gateway start/stop/crash scenarios and confirm
record_gateway_pid writes the expected PID, SANDBOX_CHILD_PIDS and
SANDBOX_WAIT_PID are populated correctly after launches and respawns, the
respawn sliding-window logic (RESPAWN_TIMES/RESPAWN_COUNT) enforces the intended
throttling/alerts, the persistent log mirror started by
start_persistent_gateway_log_mirror captures /tmp/gateway.log into the durable
sandbox log, and validate_tmp_permissions/path ownership fixes
(fix_openclaw_ownership, provision_agent_workspaces,
seed_default_workspace_templates_as_sandbox) preserve permissions so restarts
and auto-pairing succeed; file any failures or flakes as regressions against
these functions for follow-up.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9807d797-5212-49f4-8643-dd3a7b8c6702

📥 Commits

Reviewing files that changed from the base of the PR and between 2684c22 and 4a4122c.

📒 Files selected for processing (4)
  • Dockerfile
  • scripts/nemoclaw-start.sh
  • test/gateway-pid-recording.test.ts
  • test/sandbox-provisioning.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/sandbox-provisioning.test.ts

…4952)

The runLaunchBlock harness in nemoclaw-start.test.ts extracts the gateway
launch block and runs it with its helper calls stubbed. The new
record_gateway_pid call added for #4952 was not stubbed, so the extracted
block hit an undefined command (exit 127) and failed the two launch-block
tests in CI. Stub it alongside cleanup_on_signal (kept on one line to stay
within the test-file-size budget) so the extracted block does not write the
host /tmp during the test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: jason <jama@nvidia.com>
@wscurran wscurran added area: integrations Third-party service integration behavior area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression integration: openclaw OpenClaw integration behavior labels Jun 8, 2026
@wscurran
Copy link
Copy Markdown
Contributor

wscurran commented Jun 8, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: integrations Third-party service integration behavior area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression integration: openclaw OpenClaw integration behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[All Platforms][Sandbox] sandbox container HEALTHCHECK always unhealthy — gateway not on container loopback and pgrep fallback uses wrong argv pattern

2 participants