fix(sandbox): match re-execed plain-openclaw gateway argv in HEALTHCHECK (#4952)#4958
fix(sandbox): match re-execed plain-openclaw gateway argv in HEALTHCHECK (#4952)#4958jason-ma-nv wants to merge 3 commits into
Conversation
…ECK (#4952) Recent OpenClaw (v0.0.44 / 2026.5.18+) re-execs the long-running gateway into a process whose argv is plain `openclaw` with no `gateway` token. On runtime shapes where the in-container curl probe fails (connection refused, exit 7) and the /tmp/nemoclaw-gateway-local marker is present, the gateway-liveness fallback used `pgrep -f 'openclaw[ -]gateway'`, which cannot see the re-execed process — so the container was reported permanently unhealthy even though the gateway was alive and serving. Add a `pgrep -x openclaw` fallback that matches the re-execed form by exact process name, mirroring the gateway_pid() helper in test/e2e/test-issue-2478-crash-loop-recovery.sh (gateway-token match first, bare-openclaw fallback second). Cover the regression in test/sandbox-provisioning.test.ts with a pgrep mock that matches its -f/-x pattern against a simulated process table, so the probe outcome depends on the real argv shape rather than a forced exit code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jason <jama@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughStart script now records gateway PID to /tmp/nemoclaw-gateway.pid; Dockerfile HEALTHCHECK first tries pgrep for gateway argv and falls back to verifying the recorded PID's command name when curl reports connection refused. Tests validate PID recording and the HEALTHCHECK probe across argv variants and failure modes. ChangesSandbox Healthcheck Process Detection Fix
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 1 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
…ay PID (#4952) Tighten the re-execed-gateway fallback added for #4952. A bare `pgrep -x openclaw` only proves *some* process named `openclaw` exists, so a marker-present container with a stale non-empty /tmp/gateway.log plus an unrelated `openclaw` one-shot (e.g. `openclaw agent ...`) could keep Docker healthy after the real gateway died, weakening restart/self-healing. nemoclaw-start now records the live gateway PID in /tmp/nemoclaw-gateway.pid (record_gateway_pid, written on both the root and non-root launch paths and refreshed on every respawn). When the `openclaw[ -]gateway` pgrep pattern misses, the HEALTHCHECK confirms THAT recorded PID is still a live `openclaw` process via `ps -p <pid> -o comm=`, with a comm-prefix guard against PID reuse. This is gateway-specific and no longer fooled by a non-gateway `openclaw` process. Tests: - test/sandbox-provisioning.test.ts: drive the fallback with a recorded PID file + ps mock; add the reviewer's case (stray non-gateway `openclaw` + dead recorded PID -> unhealthy) and a PID-reuse case. - test/gateway-pid-recording.test.ts (new, focused file): record_gateway_pid writes the PID file the HEALTHCHECK reads and never fails startup on a write error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jason <jama@nvidia.com>
|
Pushed a follow-up addressing the review feedback. PR Review Advisor — "Tighten bare New/updated tests:
Linked Issues check (curl probe): the curl exit-7 path is handled by design — it falls back rather than failing — so the "always unhealthy" symptom was driven by the liveness fallback, which is what this PR corrects. Binding the gateway to loopback / probing its actual listener address is a separate concern best tracked on its own; happy to open a follow-up if maintainers prefer the broader change here. E2E advisors: the suggested real-container assertion of |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
Dockerfile (1)
964-979: Run the container-level E2E health assertions for this HEALTHCHECK change.Given this is Docker HEALTHCHECK behavior, please validate with the recommended E2E jobs to confirm
.State.Health.Statusbehavior in a real container runtime.As per coding guidelines, Dockerfile-layer behavior is only fully testable with real container builds and should be validated via the listed nightly E2E jobs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Dockerfile` around lines 964 - 979, Run container-level E2E validation for the new HEALTHCHECK snippet: build and run the image and exercise the HEALTHCHECK path that reads NEMOCLAW_DASHBOARD_PORT/OPENCLAW_GATEWAY_PORT/CHAT_UI_URL, probes http://127.0.0.1:${port}/health, and falls back to the local gateway checks (/tmp/nemoclaw-gateway-local, /tmp/nemoclaw-gateway.pid, process matching 'openclaw gateway', and /tmp/gateway.log); use the recommended nightly E2E jobs to confirm the container .State.Health.Status transitions for success, curl connection refused (rc=7) and other failures, and document any runtime differences from the Dockerfile logic.Source: Coding guidelines
scripts/nemoclaw-start.sh (1)
3185-3594: Run the entrypoint-focused E2E suite for restart and recovery semantics.This PID-recording path affects every sandbox boot and respawn loop; please validate with the recommended
sandbox-survival-e2e,sandbox-operations-e2e,cloud-e2e, andopenclaw-slack-pairing-e2ejobs.As per coding guidelines,
scripts/nemoclaw-start.shchanges are not fully covered by unit tests and should be verified via the specified nightly E2E workflows.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/nemoclaw-start.sh` around lines 3185 - 3594, Run the entrypoint-focused E2E suites (sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e, openclaw-slack-pairing-e2e) to validate the new non-root and root PID/respawn behavior: exercise gateway start/stop/crash scenarios and confirm record_gateway_pid writes the expected PID, SANDBOX_CHILD_PIDS and SANDBOX_WAIT_PID are populated correctly after launches and respawns, the respawn sliding-window logic (RESPAWN_TIMES/RESPAWN_COUNT) enforces the intended throttling/alerts, the persistent log mirror started by start_persistent_gateway_log_mirror captures /tmp/gateway.log into the durable sandbox log, and validate_tmp_permissions/path ownership fixes (fix_openclaw_ownership, provision_agent_workspaces, seed_default_workspace_templates_as_sandbox) preserve permissions so restarts and auto-pairing succeed; file any failures or flakes as regressions against these functions for follow-up.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@Dockerfile`:
- Around line 964-979: Run container-level E2E validation for the new
HEALTHCHECK snippet: build and run the image and exercise the HEALTHCHECK path
that reads NEMOCLAW_DASHBOARD_PORT/OPENCLAW_GATEWAY_PORT/CHAT_UI_URL, probes
http://127.0.0.1:${port}/health, and falls back to the local gateway checks
(/tmp/nemoclaw-gateway-local, /tmp/nemoclaw-gateway.pid, process matching
'openclaw gateway', and /tmp/gateway.log); use the recommended nightly E2E jobs
to confirm the container .State.Health.Status transitions for success, curl
connection refused (rc=7) and other failures, and document any runtime
differences from the Dockerfile logic.
In `@scripts/nemoclaw-start.sh`:
- Around line 3185-3594: Run the entrypoint-focused E2E suites
(sandbox-survival-e2e, sandbox-operations-e2e, cloud-e2e,
openclaw-slack-pairing-e2e) to validate the new non-root and root PID/respawn
behavior: exercise gateway start/stop/crash scenarios and confirm
record_gateway_pid writes the expected PID, SANDBOX_CHILD_PIDS and
SANDBOX_WAIT_PID are populated correctly after launches and respawns, the
respawn sliding-window logic (RESPAWN_TIMES/RESPAWN_COUNT) enforces the intended
throttling/alerts, the persistent log mirror started by
start_persistent_gateway_log_mirror captures /tmp/gateway.log into the durable
sandbox log, and validate_tmp_permissions/path ownership fixes
(fix_openclaw_ownership, provision_agent_workspaces,
seed_default_workspace_templates_as_sandbox) preserve permissions so restarts
and auto-pairing succeed; file any failures or flakes as regressions against
these functions for follow-up.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9807d797-5212-49f4-8643-dd3a7b8c6702
📒 Files selected for processing (4)
Dockerfilescripts/nemoclaw-start.shtest/gateway-pid-recording.test.tstest/sandbox-provisioning.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- test/sandbox-provisioning.test.ts
…4952) The runLaunchBlock harness in nemoclaw-start.test.ts extracts the gateway launch block and runs it with its helper calls stubbed. The new record_gateway_pid call added for #4952 was not stubbed, so the extracted block hit an undefined command (exit 127) and failed the two launch-block tests in CI. Stub it alongside cleanup_on_signal (kept on one line to stay within the test-file-size budget) so the extracted block does not write the host /tmp during the test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jason <jama@nvidia.com>
Summary
The sandbox container Docker HEALTHCHECK reported every NemoClaw sandbox as
(unhealthy)even when the gateway was alive and serving. Recent OpenClaw (v0.0.44 / 2026.5.18+) re-execs the long-running gateway into a process whose argv is plainopenclawwith nogatewaytoken, which the gateway-liveness fallback'spgrep -f 'openclaw[ -]gateway'could not match. This adds apgrep -x openclawfallback so the re-execed form is recognized.Related Issue
Fixes #4952
Changes
Dockerfile: the HEALTHCHECK gateway-liveness fallback now triespgrep --ignore-ancestors -f 'openclaw[ -]gateway'first and falls back topgrep --ignore-ancestors -x openclaw, matching the re-execed plain-openclawargv by exact process name. This mirrors the establishedgateway_pid()helper intest/e2e/test-issue-2478-crash-loop-recovery.sh(gateway-token match first, bare-openclawfallback second). The surrounding comment is updated to explain the new form.test/sandbox-provisioning.test.ts: new#4952describe block that drives apgrepmock which actually matches its-f/-xpattern against a simulated process table — so the probe outcome depends on the real argv shape rather than a forced exit code (which the existingrunProductionHealthProbeharness cannot exercise). Covers the plain-openclawargv, the launcheropenclaw gateway runform, the legacyopenclaw-gatewayform, and the no-process case.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Signed-off-by: jama@nvidia.com jama@nvidia.com
Summary by CodeRabbit
Bug Fixes
New Features
Documentation
Tests