fix(onboard): pre-approve gateway scope upgrades after onboard and recover#4763
fix(onboard): pre-approve gateway scope upgrades after onboard and recover#4763TonyLuo-NV wants to merge 2 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds connect-time auto-pair budgeting constants, parameterizes and exports the sandbox approval pass, runs that pass (best-effort) from connect probe paths, wires a lazy finalization dependency to invoke it after sandbox recovery, and adds tests and a helper for probe/recover and gateway-down cases. ChangesAuto-pair scope approval integration
Sequence DiagramsequenceDiagram
participant Onboard as handleFinalizationState
participant Recovery as checkAndRecoverSandboxProcesses
participant Approval as autoPairScopeApproval
participant Verify as deploymentVerification
Onboard->>Recovery: recover sandbox processes
Recovery-->>Onboard: recovery complete
Onboard->>Approval: sweep pending scope upgrades (best-effort)
Approval-->>Onboard: approval complete (non-throwing)
Onboard->>Verify: verify deployment
Verify-->>Onboard: verification complete
Onboard->>Onboard: record session & print dashboard
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/onboard/machine/handlers/finalization.ts (1)
105-108: Run the onboarding/sandbox E2E slice for this integration point.Since this hook changes finalization handoff behavior, run the recommended selective nightly E2E jobs before merge to validate cross-layer behavior under real gateway/sandbox lifecycle conditions.
As per coding guidelines:
src/lib/onboard.tschanges affect full sandbox creation/configuration flow and include explicit E2E test recommendations.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/machine/handlers/finalization.ts` around lines 105 - 108, This change to finalization (deps.autoPairScopeApproval in src/lib/onboard/machine/handlers/finalization.ts) alters the sandbox/gateway handoff path and must be validated by running the selective nightly E2E slice for onboarding/sandbox lifecycle; before merging, execute the recommended E2E jobs that cover the sandbox creation/configuration flow (as referenced by src/lib/onboard.ts), verify there are no regressions in gateway recovery, pairing handoff, or stuck pairing requests, and attach the E2E run IDs and any failing test traces to the PR for review.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 7032-7036: The autoPairScopeApproval callback wiring currently
defined as autoPairScopeApproval: (name) => { const connect: typeof
import("./actions/sandbox/connect") = require("./actions/sandbox/connect");
connect.runConnectAutoPairApprovalPass(name); } should be moved out of
src/lib/onboard.ts into a new module under src/lib/onboard/ (for example
src/lib/onboard/autoPair.ts) or into the file that implements the finalization
handler; export a function (e.g., initAutoPairScopeApproval) that registers the
callback there and import/invoke that initializer from the finalization handler
so onboard.ts remains growth-neutral while still calling
connect.runConnectAutoPairApprovalPass(name) via the extracted initializer.
---
Nitpick comments:
In `@src/lib/onboard/machine/handlers/finalization.ts`:
- Around line 105-108: This change to finalization (deps.autoPairScopeApproval
in src/lib/onboard/machine/handlers/finalization.ts) alters the sandbox/gateway
handoff path and must be validated by running the selective nightly E2E slice
for onboarding/sandbox lifecycle; before merging, execute the recommended E2E
jobs that cover the sandbox creation/configuration flow (as referenced by
src/lib/onboard.ts), verify there are no regressions in gateway recovery,
pairing handoff, or stuck pairing requests, and attach the E2E run IDs and any
failing test traces to the PR for review.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a2e6af1d-6a39-45c7-9c8b-c3ec6723a137
📒 Files selected for processing (6)
src/lib/actions/sandbox/connect-autopair-budget.tssrc/lib/actions/sandbox/connect.tssrc/lib/onboard.tssrc/lib/onboard/machine/handlers/finalization.test.tssrc/lib/onboard/machine/handlers/finalization.tstest/sandbox-connect-inference.test.ts
7e5e79e to
6767c9b
Compare
|
✨ |
prekshivyas
left a comment
There was a problem hiding this comment.
Reviewed the logic and it's sound — the env-triplet strip and the budget invariant (2s + 10s×1 = 12s < 15s, extracted to the dep-free connect-autopair-budget leaf) both check out, and the first-run limitation is honestly documented. The blocker is that this now conflicts semantically with main, which independently reworked the same runConnectAutoPairApprovalPass path. Mapping the overlap so the rebase is targeted:
- Env-strip is already on main (now redundant here).
main'sscripts/lib/openclaw_device_approval_policy.pydefinesgateway_approval_env(), which popsGATEWAY_APPROVAL_ENV_KEYS = ("OPENCLAW_GATEWAY_URL", "OPENCLAW_GATEWAY_PORT", "OPENCLAW_GATEWAY_TOKEN")— the exact triplet this PR strips inline. So the#4462self-defeat fix is covered on main; the inlineapprove_env.pop(...)block can drop in favor ofgateway_approval_env(os.environ). runConnectAutoPairApprovalPasswas restructured on main — it now reads the externalopenclaw_device_approval_policy.pymodule and uses inlineMAX_APPROVALS=8/TIMEOUT_MS=12_000. That collides head-on with this PR's budget-leaf extraction andMAX_APPROVALS=1/15s. Worth a deliberate call on which budget wins (the1/15srationale here is reasonable, but it needs to sit on top of main's policy-module structure, not replace it).- The wiring is the part still genuinely missing on main.
finalization.tson main has zero references to the approval pass, so running it in onboard finalization and inrecoveris the unique, still-wanted contribution.
Suggested path: rebase on main, drop the now-redundant env-strip, reconcile the budget constants with main's structure, and keep the finalization/recover wiring (+ its tests). That should shrink the diff considerably. Happy to re-review once it's rebased.
…cover After a fresh onboard, the first `openclaw agent` run emitted "scope upgrade pending approval" and silently fell back to embedded mode, and the same pending requestId persisted across every subsequent run — the approval never auto-resolved. Root cause was a divergence in the host one-shot auto-pair approval pass: it ran `openclaw devices approve` with a 1s timeout and stripped only OPENCLAW_GATEWAY_URL from the child env. Leaving PORT and TOKEN set re-pinned the approve to the gateway, hitting the NVIDIA#4462 self-defeat where a gateway-pinned scope-upgrade approve requests the very scopes it is trying to grant and fails — so the request was never cleared. - Match the in-sandbox watcher: approve timeout 1s -> 10s, strip OPENCLAW_GATEWAY_URL + PORT + TOKEN. MAX_APPROVALS 8 -> 1 (single realistic pending upgrade) and outer cap 12s -> 15s so a slow 10s approve is never SIGKILLed mid-loop (2 + 10x1 = 12s < 15s). - Wire the now-correct pass into onboard finalization (after process recovery, before verification) and the `nemoclaw recover` / probe-only path (gateway-up branches only, not the failure exit), so `nemoclaw recover` deterministically clears a pending scope upgrade. The CLI scope-upgrade request is only created by the user's first `openclaw agent` run, so that first run can still fall back once; the deterministic remedy is `nemoclaw recover`. Closes NVIDIA#4504 Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6767c9b to
0c25fc1
Compare
|
@coderabbitai full review |
✅ Action performedFull review finished. |
|
@prekshivyas thanks — rebased onto main (
The earlier CodeRabbit |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/lib/actions/sandbox/connect.ts (1)
767-798:⚠️ Potential issue | 🟠 Major | ⚡ Quick winStale comment: update "OPENCLAW_GATEWAY_URL only" to reflect the full triplet fix.
Line 775 still says "removing OPENCLAW_GATEWAY_URL only from the child env" — this describes the OLD buggy behavior that caused
#4462. The PR summary states the fix strips all three env vars:OPENCLAW_GATEWAY_URL,OPENCLAW_GATEWAY_PORT, andOPENCLAW_GATEWAY_TOKEN. The test at line 386 ofauto-pair-approval.test.tsvalidates all three are stripped ("unset:unset:unset").📝 Proposed fix to update the comment
-// in-sandbox `openclaw devices list` invocation targets the running gateway -// with its token. Approvals then use OpenClaw's local fallback by removing -// OPENCLAW_GATEWAY_URL only from the child env, and apply the same allowlist +// in-sandbox `openclaw devices list` invocation targets the running gateway +// with its token. Approvals then use OpenClaw's local fallback by removing +// OPENCLAW_GATEWAY_URL, OPENCLAW_GATEWAY_PORT, and OPENCLAW_GATEWAY_TOKEN +// from the child env, and apply the same allowlist🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/actions/sandbox/connect.ts` around lines 767 - 798, Update the stale inline comment that currently says "removing OPENCLAW_GATEWAY_URL only from the child env" to reflect the real fix which strips the full triplet: OPENCLAW_GATEWAY_URL, OPENCLAW_GATEWAY_PORT, and OPENCLAW_GATEWAY_TOKEN; locate the comment block in src/lib/actions/sandbox/connect.ts around the one-shot / workaround description (the big comment starting above the approval logic and referenced by the auto-pair watcher behavior) and change that single sentence so it accurately states that all three env vars are removed from the child environment (matching the test in auto-pair-approval.test.ts that expects "unset:unset:unset").src/lib/onboard/finalization-deps.ts (1)
1-24:⚠️ Potential issue | 🔴 CriticalUpdate:
npm run typecheck:clicurrently fails (tsc -p tsconfig.cli.json).
The run exits with code 2 and reports many TypeScript errors (e.g., missing../dist/...modules and missing@oclif/coretype declarations), so it’s not possible to tell whethersrc/lib/onboard/finalization-deps.ts—or any suggested return-type change—compiles cleanly until the underlying CLI typecheck setup/inputs are fixed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/finalization-deps.ts` around lines 1 - 24, The CLI typecheck errors come from using typeof import("...") type annotations that force tsc to resolve non-CLI modules (causing missing ../dist and `@oclif` types); update finalizationHandlerDeps to avoid those module-resolving type annotations: remove the typeof import(...) types and instead cast the require() results to any (or annotate the functions as using any), e.g. in checkAndRecoverSandboxProcesses and autoPairScopeApproval replace the typed declarations around require("../actions/sandbox/process-recovery") and require("../actions/sandbox/connect") with an any-typed const and call the exported functions (checkAndRecoverSandboxProcesses, runConnectAutoPairApprovalPass) through that any-typed value so tsc no longer tries to resolve the referenced modules during the CLI typecheck.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard/finalization-deps.ts`:
- Around line 10-14: The wrapper checkAndRecoverSandboxProcesses in
src/lib/onboard/finalization-deps.ts currently declares a void return and
discards the result from require("../actions/sandbox/process-recovery"); change
the wrapper to preserve and return the upstream function's structured result ({
checked, wasRunning, recovered, forwardRecovered }) by updating the wrapper
signature to return that same type and returning the value from
processRecovery.checkAndRecoverSandboxProcesses(name, options) so callers get
the original recovery object and API contract remains consistent with
src/lib/actions/sandbox/process-recovery.ts and usages such as
src/lib/actions/sandbox/connect.ts.
---
Outside diff comments:
In `@src/lib/actions/sandbox/connect.ts`:
- Around line 767-798: Update the stale inline comment that currently says
"removing OPENCLAW_GATEWAY_URL only from the child env" to reflect the real fix
which strips the full triplet: OPENCLAW_GATEWAY_URL, OPENCLAW_GATEWAY_PORT, and
OPENCLAW_GATEWAY_TOKEN; locate the comment block in
src/lib/actions/sandbox/connect.ts around the one-shot / workaround description
(the big comment starting above the approval logic and referenced by the
auto-pair watcher behavior) and change that single sentence so it accurately
states that all three env vars are removed from the child environment (matching
the test in auto-pair-approval.test.ts that expects "unset:unset:unset").
In `@src/lib/onboard/finalization-deps.ts`:
- Around line 1-24: The CLI typecheck errors come from using typeof
import("...") type annotations that force tsc to resolve non-CLI modules
(causing missing ../dist and `@oclif` types); update finalizationHandlerDeps to
avoid those module-resolving type annotations: remove the typeof import(...)
types and instead cast the require() results to any (or annotate the functions
as using any), e.g. in checkAndRecoverSandboxProcesses and autoPairScopeApproval
replace the typed declarations around
require("../actions/sandbox/process-recovery") and
require("../actions/sandbox/connect") with an any-typed const and call the
exported functions (checkAndRecoverSandboxProcesses,
runConnectAutoPairApprovalPass) through that any-typed value so tsc no longer
tries to resolve the referenced modules during the CLI typecheck.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8d163a73-5e2d-462e-bdfb-d0ba9e81c8cc
📒 Files selected for processing (8)
src/lib/actions/sandbox/connect-autopair-budget.tssrc/lib/actions/sandbox/connect.tssrc/lib/onboard.tssrc/lib/onboard/finalization-deps.tssrc/lib/onboard/machine/handlers/finalization.test.tssrc/lib/onboard/machine/handlers/finalization.tstest/sandbox-connect-inference/auto-pair-approval.test.tstest/sandbox-connect-inference/helpers.ts
| checkAndRecoverSandboxProcesses(name: string, options: { quiet: boolean }): void { | ||
| const processRecovery: typeof import("../actions/sandbox/process-recovery") = | ||
| require("../actions/sandbox/process-recovery"); | ||
| processRecovery.checkAndRecoverSandboxProcesses(name, options); | ||
| }, |
There was a problem hiding this comment.
Return type mismatch breaks the upstream contract.
The upstream checkAndRecoverSandboxProcesses function returns a structured object with { checked, wasRunning, recovered, forwardRecovered } fields (see src/lib/actions/sandbox/process-recovery.ts:421-545), but this wrapper changes the return type to void. This contract break prevents the finalization handler from:
- Knowing whether the sandbox exists (
checked: false) - Determining if recovery succeeded or failed
- Inspecting whether the port forward was restored
- Making decisions or logging based on recovery outcomes
In src/lib/actions/sandbox/connect.ts:203-241, the same function's return value is actively used to control probe behavior and user messaging. Even if the current finalization handler doesn't need this information, preserving the return type:
- Maintains API consistency with the upstream function
- Enables future code to inspect recovery outcomes without modifying the wrapper
- Provides valuable debugging information for onboard failures
🔧 Proposed fix to preserve the upstream return type
- checkAndRecoverSandboxProcesses(name: string, options: { quiet: boolean }): void {
+ checkAndRecoverSandboxProcesses(
+ name: string,
+ options: { quiet: boolean },
+ ): { checked: boolean; wasRunning: boolean | null; recovered: boolean; forwardRecovered: boolean } {
const processRecovery: typeof import("../actions/sandbox/process-recovery") =
require("../actions/sandbox/process-recovery");
- processRecovery.checkAndRecoverSandboxProcesses(name, options);
+ return processRecovery.checkAndRecoverSandboxProcesses(name, options);
},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| checkAndRecoverSandboxProcesses(name: string, options: { quiet: boolean }): void { | |
| const processRecovery: typeof import("../actions/sandbox/process-recovery") = | |
| require("../actions/sandbox/process-recovery"); | |
| processRecovery.checkAndRecoverSandboxProcesses(name, options); | |
| }, | |
| checkAndRecoverSandboxProcesses( | |
| name: string, | |
| options: { quiet: boolean }, | |
| ): { checked: boolean; wasRunning: boolean | null; recovered: boolean; forwardRecovered: boolean } { | |
| const processRecovery: typeof import("../actions/sandbox/process-recovery") = | |
| require("../actions/sandbox/process-recovery"); | |
| return processRecovery.checkAndRecoverSandboxProcesses(name, options); | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/onboard/finalization-deps.ts` around lines 10 - 14, The wrapper
checkAndRecoverSandboxProcesses in src/lib/onboard/finalization-deps.ts
currently declares a void return and discards the result from
require("../actions/sandbox/process-recovery"); change the wrapper to preserve
and return the upstream function's structured result ({ checked, wasRunning,
recovered, forwardRecovered }) by updating the wrapper signature to return that
same type and returning the value from
processRecovery.checkAndRecoverSandboxProcesses(name, options) so callers get
the original recovery object and API contract remains consistent with
src/lib/actions/sandbox/process-recovery.ts and usages such as
src/lib/actions/sandbox/connect.ts.
The auto-pair approval-pass comment still described the old behavior (removing OPENCLAW_GATEWAY_URL only). Update it to reflect the shared gateway_approval_env() policy helper, which strips URL + PORT + TOKEN — the full triplet that avoids the NVIDIA#4462 self-defeat. Comment-only. Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@coderabbitai addressed your review (
|
|
Tip For best results, initiate chat on the files or code changes.
Everything looks good. 🐇 |
Summary
Fixes #4504. After a fresh
nemoclaw onboard, the firstopenclaw agentrun printedgateway connect failed: scope upgrade pending approvaland silently fell back to embedded mode (exit 0), and the same pendingrequestIdsurvived across every subsequent run — the approval was never auto-resolved (regression of NVB#6041401).Root cause
Two defects in the connect-time auto-pair approval pass (
runConnectAutoPairApprovalPass):openclaw devices approveran withtimeout=1sand stripped onlyOPENCLAW_GATEWAY_URLfrom the child env. LeavingOPENCLAW_GATEWAY_PORT+OPENCLAW_GATEWAY_TOKENset re-pinned the approve to the gateway, hitting the OpenClaw CLI scope-upgrade approval deadlocks and forces openclaw agent into embedded fallback #4462 self-defeat where a gateway-pinned scope-upgrade approve requests the very scopes it is trying to grant and fails. So the request was never cleared and the samerequestIdpersisted.connectpath, not onboard finalization ornemoclaw recover.Fix
timeout1s → 10s; stripOPENCLAW_GATEWAY_URL+PORT+TOKEN.MAX_APPROVALS8 → 1 (single realistic pending upgrade); outer cap 12s → 15s so a slow 10s approve is never SIGKILLed mid-loop (2 + 10×1 = 12s < 15s, ~3s startup slack).nemoclaw recover/ probe-only path (gateway-up branches only, not the failure exit).connect-autopair-budget.tsleaf so the invariant is asserted on real values, not source text.Known limitation
The CLI scope-upgrade request is only created by the user's first
openclaw agentrun (after onboard returns), so that first run can still fall back once — onboard has nothing to pre-approve yet. The deterministic remedy isnemoclaw recover(now actually clears it) or one ~30s in-sandbox watcher cycle. A docs/changelog note pointing users atnemoclaw recoveron first-run fallback is recommended as follow-up.Testing
test/sandbox-connect-inference.test.ts(5 new: probe-path positive/negative/best-effort, env-triplet strip, budget invariant) andsrc/lib/onboard/machine/handlers/finalization.test.ts(ordering, agent-agnostic, best-effort) — all pass.npm run typecheck:cliclean.🤖 Generated with Claude Code
Signed-off-by: Tony Luo xialuo@nvidia.com
Summary by CodeRabbit
New Features
Improvements
Tests