fix(sandbox): add opt-in fail-closed cap-drop enforcement#4707
Conversation
When a host does not grant CAP_SETPCAP (GitHub runners, Brev shadecloud, Colossus Ubuntu 24.04, Docker Desktop, WSL), drop_capabilities cannot run `capsh --drop`, so dangerous bounding-set caps survive and the agent boots with a weaker posture than the security model assumes. PR #4266 made this fall-through default fail-closed and was reverted within hours (#4341) because it broke every CAP_SETPCAP-less host. Reintroduce the protection as the inverse: the default stays warn-and-continue (zero regression — no host loses the ability to boot), and operators who require the hardened posture opt in with NEMOCLAW_REQUIRE_CAP_DROP=1 to refuse start when the bounding-set drop could not happen. Scope is the agent process tree only. The `nemoclaw connect` shell inherits the container's OCI bounding set and is not affected; tightening that needs cap_drop at sandbox-create, tracked upstream in NVIDIA/OpenShell#1452 — which is why this is Refs (not Closes) #3280. Tests drive both fall-through branches with a stubbed capsh (red before this change, green after), decode the QA-reported CapBnd fixture deterministically, and assert the default still warns-and-continues (no regression). Local pre-commit Test (CLI) hook skipped via --no-verify (operator-approved): it fails only on an offline-blocked messaging-bridge e2e test unrelated to this change; sandbox-init tests pass 32/32. CI re-runs all hooks on the PR. Refs #3280 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dongni Yang <dongniy@nvidia.com>
📝 WalkthroughWalkthroughAdds opt-in fail-closed enforcement for Linux capability bounding-set drops: new decoder helper, enforce_cap_drop_if_required() which can refuse startup when ChangesCapability enforcement in sandbox initialization
Sequence Diagram(s)sequenceDiagram
participant Entrypoint as sandbox-init.sh
participant capsh as capsh (capability tool)
participant Enforce as enforce_cap_drop_if_required
participant ProcStatus as /proc/self/status (or NEMOCLAW_PROC_STATUS)
Entrypoint->>capsh: run `capsh --drop` built from DANGEROUS_CAPS
capsh-->>Entrypoint: success or failure (CAP_SETPCAP missing)
Entrypoint->>Enforce: enforce_cap_drop_if_required()
Enforce->>ProcStatus: read CapBnd
ProcStatus-->>Enforce: return CapBnd hex
Enforce->>Enforce: dangerous_caps_in_capbnd(parse CapBnd)
Enforce-->>Entrypoint: return OK or stderr refusal + exit(1)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/sandbox-init.test.ts (1)
463-487: ⚡ Quick winExercise
report_residual_capabilities()directly here.This test reimplements the decode loop instead of invoking the production helper, so it can still pass if
scripts/lib/sandbox-init.shchanges the inspected-cap list or output format. Stubbingawkand callingreport_residual_capabilitieswould keep the regression anchored to the real contract.💡 Higher-fidelity test shape
- "cap_bnd_hex=$(awk '/^CapBnd:/{print $2}' \"$TMP/status\")", - "val=$((16#$cap_bnd_hex))", - "present_caps=''", - "for entry in \\", - ' "21:cap_sys_admin" \\', - ' "19:cap_sys_ptrace" \\', - ' "13:cap_net_raw" \\', - ' "1:cap_dac_override" \\', - ' "10:cap_net_bind_service"; do', - ' bit="${entry%%:*}"', - ' name="${entry#*:}"', - " if [ $(((val >> bit) & 1)) -ne 0 ]; then", - ' present_caps="${present_caps:+$present_caps,}$name"', - " fi", - "done", - 'echo "RESIDUAL: ${present_caps}"', + "awk() { printf '00000004a82c35fb\\n'; }", + "report_residual_capabilities 2>&1",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/sandbox-init.test.ts` around lines 463 - 487, The test currently reimplements the capability decode loop; instead, stub the system helper (awk) to return the CapBnd fixture and invoke the production helper report_residual_capabilities from scripts/lib/sandbox-init.sh so the test exercises the real decoding contract; modify the test to export a TMP status file with "CapBnd: <hex>", set PATH or mock awk to return that field, call report_residual_capabilities and assert its output contains the same "RESIDUAL: ..." string, and remove the inline loop so future changes to report_residual_capabilities or the inspected-cap list will be caught by this test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/lib/sandbox-init.sh`:
- Around line 271-284: The refusal banner in enforce_cap_drop_if_required()
always blames "no CAP_SETPCAP" even when the real problem is missing capsh;
update the heredoc to use the passed-in reason variable (Reason: ${reason}) and
adjust the explanatory text to not hard-code "no CAP_SETPCAP" — instead mention
that the drop was skipped and include the specific cause (e.g., capsh
unavailable or lacking CAP_SETPCAP) so strict-mode users see the correct
diagnosis; ensure references to NEMOCLAW_REQUIRE_CAP_DROP and the
enforce_cap_drop_if_required function are preserved.
---
Nitpick comments:
In `@test/sandbox-init.test.ts`:
- Around line 463-487: The test currently reimplements the capability decode
loop; instead, stub the system helper (awk) to return the CapBnd fixture and
invoke the production helper report_residual_capabilities from
scripts/lib/sandbox-init.sh so the test exercises the real decoding contract;
modify the test to export a TMP status file with "CapBnd: <hex>", set PATH or
mock awk to return that field, call report_residual_capabilities and assert its
output contains the same "RESIDUAL: ..." string, and remove the inline loop so
future changes to report_residual_capabilities or the inspected-cap list will be
caught by this test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1c4c6b41-75b3-4e2b-b17e-f1a354c1e720
📒 Files selected for processing (2)
scripts/lib/sandbox-init.shtest/sandbox-init.test.ts
PR Review AdvisorFindings: 0 needs attention, 3 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
Address review on #3280: - enforce_cap_drop_if_required verifies the actual bounding set on every path instead of trusting the NEMOCLAW_CAPS_DROPPED sentinel, closing a strict-mode bypass; the refusal reason is derived from verified state. - Extract dangerous_caps_in_capbnd() helper + NEMOCLAW_PROC_STATUS test seam; add sentinel-bypass and decode regression tests (35 pass). - Quote setpriv --reuid/--regid/--bounding-set array elements (SC2191). - Document NEMOCLAW_REQUIRE_CAP_DROP in best-practices and sandbox-hardening. Refs #3280 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
|
🌿 Preview your docs: https://nvidia-preview-pr-4707.docs.buildwithfern.com/nemoclaw |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
docs/security/best-practices.mdx (1)
297-297: ⚡ Quick winSplit Line 297 to one sentence per source line.
Line 297 currently packs multiple sentences into one line, which breaks the docs diff-readability rule.
Suggested edit
-To make the drop fail-closed instead of best-effort, set `NEMOCLAW_REQUIRE_CAP_DROP=1` in the entrypoint environment. The agent then refuses to start unless the agent process tree's bounding set is verified free of the dangerous capabilities, so it will not boot on hosts that cannot perform the drop (no `CAP_SETPCAP`, or `capsh` missing). This is opt-in because such hosts are common (many cloud VMs, Docker Desktop, WSL); leaving it unset preserves the best-effort default. The check covers the agent process tree only — a `nemoclaw connect` shell is spawned by the container runtime outside that tree and is not affected (tracked in [NVIDIA/OpenShell#1452](https://github.com/NVIDIA/OpenShell/issues/1452)). +To make the drop fail-closed instead of best-effort, set `NEMOCLAW_REQUIRE_CAP_DROP=1` in the entrypoint environment. +The agent then refuses to start unless the agent process tree's bounding set is verified free of the dangerous capabilities, so it does not boot on hosts that cannot perform the drop (no `CAP_SETPCAP`, or `capsh` missing). +This is opt-in because such hosts are common (many cloud VMs, Docker Desktop, WSL); leaving it unset preserves the best-effort default. +The check covers the agent process tree only — a `nemoclaw connect` shell is spawned by the container runtime outside that tree and is not affected (tracked in [NVIDIA/OpenShell#1452](https://github.com/NVIDIA/OpenShell/issues/1452)).As per coding guidelines, "One sentence per line in source (makes diffs readable)."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/security/best-practices.mdx` at line 297, The paragraph containing the explanation about NEMOCLAW_REQUIRE_CAP_DROP should be split so each sentence is on its own source line to satisfy the "one sentence per line" docs rule; locate the paragraph that mentions NEMOCLAW_REQUIRE_CAP_DROP, the agent refusing to start unless the agent process tree's bounding set is verified free of dangerous capabilities, the examples of hosts where this fails (cloud VMs, Docker Desktop, WSL), and the note about the check covering only the agent process tree and the nemoclaw connect shell (linked NVIDIA/OpenShell#1452), then break that single long line into multiple lines so each independent sentence is its own line.docs/deployment/sandbox-hardening.mdx (1)
42-43: ⚡ Quick winReplace clause colons with sentence breaks.
Line 42 and Line 43 use colons as general punctuation between clauses; this page style reserves colons for introducing lists.
Suggested edit
-The bounding-set drop is best effort: if `capsh` or `CAP_SETPCAP` is unavailable the entrypoint logs a warning and continues with the runtime-provided capability set, and if `setpriv` is unavailable it falls back to `gosu`. -To make the drop fail-closed instead, set `NEMOCLAW_REQUIRE_CAP_DROP=1` in the entrypoint environment: the agent then refuses to start unless the agent process tree's bounding set is verified free of the dangerous capabilities. +The bounding-set drop is best effort. +If `capsh` or `CAP_SETPCAP` is unavailable, the entrypoint logs a warning and continues with the runtime-provided capability set, and if `setpriv` is unavailable, it falls back to `gosu`. +To make the drop fail-closed instead, set `NEMOCLAW_REQUIRE_CAP_DROP=1` in the entrypoint environment. +The agent then refuses to start unless the agent process tree's bounding set is verified free of the dangerous capabilities.As per coding guidelines, "Colons should only introduce a list. Flag colons used as general punctuation between clauses."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/deployment/sandbox-hardening.mdx` around lines 42 - 43, The two sentences in the docs mention multiple clauses joined with colons; replace those colons with proper sentence breaks so colons are only used to introduce lists per style guidelines. Specifically, in the paragraph referencing capsh, CAP_SETPCAP, setpriv, gosu and the environment variable NEMOCLAW_REQUIRE_CAP_DROP, split the clauses into separate sentences (e.g., end the first clause after "runtime-provided capability set." and start a new sentence for the fallback to gosu; similarly end the first sentence about the bounding-set drop and start a new sentence for how to make it fail-closed) so the colons are removed and grammar reads as discrete sentences.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/sandbox-init.test.ts`:
- Around line 407-458: The test "warns but does NOT refuse to start when
CAP_SETPCAP is unavailable (issue `#3280`)" can inherit NEMOCLAW_REQUIRE_CAP_DROP
from the test runner because runWithLib merges process.env with the provided
opts.env; update the runWithLib invocation in this test to explicitly include
NEMOCLAW_REQUIRE_CAP_DROP: "" in the env override (same approach used by the
other strict-mode test) so the test runs in non-strict mode regardless of the
external environment.
---
Nitpick comments:
In `@docs/deployment/sandbox-hardening.mdx`:
- Around line 42-43: The two sentences in the docs mention multiple clauses
joined with colons; replace those colons with proper sentence breaks so colons
are only used to introduce lists per style guidelines. Specifically, in the
paragraph referencing capsh, CAP_SETPCAP, setpriv, gosu and the environment
variable NEMOCLAW_REQUIRE_CAP_DROP, split the clauses into separate sentences
(e.g., end the first clause after "runtime-provided capability set." and start a
new sentence for the fallback to gosu; similarly end the first sentence about
the bounding-set drop and start a new sentence for how to make it fail-closed)
so the colons are removed and grammar reads as discrete sentences.
In `@docs/security/best-practices.mdx`:
- Line 297: The paragraph containing the explanation about
NEMOCLAW_REQUIRE_CAP_DROP should be split so each sentence is on its own source
line to satisfy the "one sentence per line" docs rule; locate the paragraph that
mentions NEMOCLAW_REQUIRE_CAP_DROP, the agent refusing to start unless the agent
process tree's bounding set is verified free of dangerous capabilities, the
examples of hosts where this fails (cloud VMs, Docker Desktop, WSL), and the
note about the check covering only the agent process tree and the nemoclaw
connect shell (linked NVIDIA/OpenShell#1452), then break that single long line
into multiple lines so each independent sentence is its own line.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2114b6a2-7880-4500-8f1f-9671f8d3da0f
📒 Files selected for processing (4)
docs/deployment/sandbox-hardening.mdxdocs/security/best-practices.mdxscripts/lib/sandbox-init.shtest/sandbox-init.test.ts
- test: also unset NEMOCLAW_REQUIRE_CAP_DROP in the warn-and-continue case so it cannot inherit a strict-mode value from the runner env. - docs: wrap the NEMOCLAW_REQUIRE_CAP_DROP paragraphs one sentence per line in best-practices.mdx and sandbox-hardening.mdx (docs readability). Refs #3280 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Adversarial review found the strict gate verified only 5 of the 10 caps capsh drops, so a host retaining cap_setfcap/cap_mknod/cap_sys_chroot/ cap_fsetid/cap_audit_write passed NEMOCLAW_REQUIRE_CAP_DROP=1 as "clean". - Single DANGEROUS_CAPS list feeds both the capsh --drop arguments and dangerous_caps_in_capbnd, so the drop-set and verify-set cannot drift. - Reject a non-empty but malformed CapBnd (return nonzero) and treat it as unverifiable: refuse in strict mode, warn in default mode, instead of leaking a raw bash arithmetic error. - Tests: expect all 10 caps; add strict-refuse and default-warn cases for an unparseable CapBnd. - Docs: strict mode refuses based on the verified bounding set, not on whether the host can perform the drop. Refs #3280 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
|
✨ Thanks for submitting this detailed PR about adding an opt-in fail-closed guard to enforce cap-drop in the sandbox, which improves the security posture of the agent. This proposes a bug fix for the sandbox that affects the security model. Related open PRs: Related open issues: |
Summary
On hosts without CAP_SETPCAP (GitHub runners, Brev shadecloud, Colossus Ubuntu
24.04, Docker Desktop, WSL),
drop_capabilitiescannot runcapsh --drop, sodangerous bounding-set caps survive and the agent boots with a weaker posture
than the security model assumes. This adds an opt-in fail-closed guard
(
NEMOCLAW_REQUIRE_CAP_DROP=1) that verifies the agent process tree's actualbounding set and refuses to start if dangerous capabilities remain.
Related Issue
Refs #3280
Changes
scripts/lib/sandbox-init.sh:enforce_cap_drop_if_requiredruns on every path out ofdrop_capabilitiesand verifies the actualCapBndrather than trustingthe
NEMOCLAW_CAPS_DROPPEDsentinel (closes a strict-mode bypass).DANGEROUS_CAPSlist feeds both thecapsh --droparguments andthe verifier, so the drop-set and the verify-set (all 10 caps) cannot drift.
CapBndis treated as unverifiable: refuse instrict mode, warn in default mode — never a raw bash arithmetic error.
setpriv-based privilege step-down strips the remaining load-bearing capsatomically with the setuid transition (falls back to
gosuwhensetpriv/CAP_SETPCAPare unavailable).reverted fix(sandbox): refuse to start when bounding-set cap drop fails (#4264) #4266); operators set
NEMOCLAW_REQUIRE_CAP_DROP=1to refuse-to-start.test/sandbox-init.test.ts: repro + strict-mode + sentinel-bypass +malformed-
CapBndcoverage, seam-driven (NEMOCLAW_PROC_STATUS) anddeterministic.
docs/security/best-practices.mdx,docs/deployment/sandbox-hardening.mdx:document
NEMOCLAW_REQUIRE_CAP_DROPand the verified scope.Type of Change
Verification
npx prek run --all-filespassesSigned-off-by: Dongni Yang dongniy@nvidia.com
Summary by CodeRabbit
New Features
Behavior
Tests
Documentation