Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/deployment/sandbox-hardening.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,11 @@ It removes `CAP_SYS_ADMIN`, `CAP_SYS_PTRACE`, `CAP_NET_RAW`,
`CAP_MKNOD`, `CAP_AUDIT_WRITE`, and `CAP_NET_BIND_SERVICE`.
When `setpriv` is available, the entrypoint also removes the remaining privilege-separation capabilities during the switch from root to the `sandbox` and `gateway` users.

The bounding-set drop is best effort: if `capsh` or `CAP_SETPCAP` is unavailable the entrypoint logs a warning and continues with the runtime-provided capability set.
If `setpriv` is unavailable, the entrypoint falls back to `gosu`.
To make the drop fail-closed instead, set `NEMOCLAW_REQUIRE_CAP_DROP=1` in the entrypoint environment: the agent then refuses to start unless the agent process tree's bounding set is verified free of the dangerous capabilities.
This is opt-in because hosts that cannot drop capabilities (no `CAP_SETPCAP` — many cloud VMs, Docker Desktop, WSL) are common, and the check covers the agent process tree only.

For defense-in-depth, also drop all Linux capabilities at the container runtime
when you launch the image directly:

Expand Down
6 changes: 6 additions & 0 deletions docs/security/best-practices.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,12 @@ During `setpriv` step-down, the child process also loses `cap_setuid`, `cap_setg

This behavior is best effort: if `capsh` is not available or `CAP_SETPCAP` is not in the bounding set, the entrypoint logs a warning and continues with the default capability set.
If `setpriv` is unavailable, the entrypoint falls back to `gosu` and logs a warning that the remaining bounding-set capabilities were retained for the child process.

To make the drop fail-closed instead of best-effort, set `NEMOCLAW_REQUIRE_CAP_DROP=1` in the entrypoint environment.
The agent then refuses to start unless the agent process tree's bounding set is verified free of the dangerous capabilities, so it will not boot on a host whose bounding set still holds them — typically one that cannot perform the drop (no `CAP_SETPCAP`, or `capsh` missing) and was not given a clean bounding set by the container runtime.
This is opt-in because such hosts are common (many cloud VMs, Docker Desktop, WSL); leaving it unset preserves the best-effort default.
The check covers the agent process tree only — a `nemoclaw connect` shell is spawned by the container runtime outside that tree and is not affected (tracked in [NVIDIA/OpenShell#1452](https://github.com/NVIDIA/OpenShell/issues/1452)).

For additional protection, pass `--cap-drop=ALL` with `docker run` or Compose (see [Sandbox Hardening](../manage-sandboxes/sandbox-hardening)).

| Aspect | Detail |
Expand Down
162 changes: 135 additions & 27 deletions scripts/lib/sandbox-init.sh
Original file line number Diff line number Diff line change
Expand Up @@ -214,12 +214,44 @@ lock_config_after_write() {
# cap_kill — sandbox user signals gateway-user processes via the UID
# separation enforced by the entrypoint (see test 13 in
# e2e-gateway-isolation.sh).
# When the runtime cannot drop the bounding set (no CAP_SETPCAP, or capsh
# missing), the default is to warn and continue. Set NEMOCLAW_REQUIRE_CAP_DROP=1
# to make that case fail-closed instead — see enforce_cap_drop_if_required.
#
# Ref: https://github.com/NVIDIA/NemoClaw/issues/797
# https://github.com/NVIDIA/NemoClaw/issues/3280
# https://github.com/NVIDIA/OpenShell/issues/1452 (connect-shell scope)
#
# Usage:
# drop_capabilities /usr/local/bin/nemoclaw-start "$@"
#
# Single source of truth for the dangerous capabilities the entrypoint drops
# and (in strict mode) verifies are gone. "bit:name" pairs; bit numbers per
# /usr/include/linux/capability.h. Both the capsh --drop list and
# dangerous_caps_in_capbnd() derive from this array, so the drop-set and the
# strict-mode verify-set cannot drift apart (issue #3280).
DANGEROUS_CAPS=(
"21:cap_sys_admin"
"19:cap_sys_ptrace"
"13:cap_net_raw"
"1:cap_dac_override"
"18:cap_sys_chroot"
"4:cap_fsetid"
"31:cap_setfcap"
"27:cap_mknod"
"29:cap_audit_write"
"10:cap_net_bind_service"
)

# Comma-separated capability names for `capsh --drop`, derived from DANGEROUS_CAPS.
dangerous_caps_drop_list() {
local entry out=""
for entry in "${DANGEROUS_CAPS[@]}"; do
out="${out:+$out,}${entry#*:}"
done
printf '%s' "$out"
}

# The first argument is the absolute path to the entrypoint script to
# re-exec via capsh. Remaining arguments are forwarded.
drop_capabilities() {
Expand All @@ -232,14 +264,102 @@ drop_capabilities() {
if capsh --has-p=cap_setpcap 2>/dev/null; then
export NEMOCLAW_CAPS_DROPPED=1
exec capsh \
--drop=cap_sys_admin,cap_sys_ptrace,cap_net_raw,cap_dac_override,cap_sys_chroot,cap_fsetid,cap_setfcap,cap_mknod,cap_audit_write,cap_net_bind_service \
--drop="$(dangerous_caps_drop_list)" \
-- -c "exec $entrypoint \"\$@\"" -- "$@"
else
report_residual_capabilities
fi
# CAP_SETPCAP missing (or the exec above failed): the drop could not run.
# Surface the residual bounding-set caps in the log.
report_residual_capabilities || true
elif [ "${NEMOCLAW_CAPS_DROPPED:-}" != "1" ]; then
echo "[SECURITY WARNING] capsh not available — running with default capabilities" >&2
fi

# Opt-in fail-closed gate (issue #3280). Deliberately runs on EVERY path,
# including when NEMOCLAW_CAPS_DROPPED is already set: it verifies the actual
# bounding set rather than trusting that sentinel, so an inherited marker
# cannot mask a drop that never happened.
enforce_cap_drop_if_required
}

# Pure decode: given a CapBnd hex string, echo the comma-separated list of the
# dangerous capabilities present (empty string if none). Factored out so the
# residual diagnostic, the strict-mode gate, and the unit tests all share one
# implementation instead of re-deriving the bit math.
#
# Bash arithmetic handles 64-bit ints on 64-bit platforms; CAP_LAST_CAP is ~41
# today, well within range. Avoids a gawk-strtonum dependency.
#
# Returns nonzero with no output if the hex is empty or malformed, so callers
# can treat "could not parse" the same as "could not read" instead of silently
# treating an unparseable bounding set as clean (issue #3280).
dangerous_caps_in_capbnd() {
local cap_bnd_hex="$1" val entry bit name present=""
case "$cap_bnd_hex" in
"" | *[!0-9A-Fa-f]*) return 1 ;;
esac
val=$((16#$cap_bnd_hex))
for entry in "${DANGEROUS_CAPS[@]}"; do
bit="${entry%%:*}"
name="${entry#*:}"
if [ $(((val >> bit) & 1)) -ne 0 ]; then
present="${present:+$present,}$name"
fi
done
printf '%s' "$present"
}

# Opt-in fail-closed enforcement (issue #3280). When NEMOCLAW_REQUIRE_CAP_DROP=1
# the sandbox refuses to start unless the bounding set is provably free of the
# dangerous capabilities. It verifies by reading the ACTUAL CapBnd — NOT by
# trusting the NEMOCLAW_CAPS_DROPPED sentinel, which an inherited environment
# could forge to bypass the gate.
#
# DEFAULT (unset) IS WARN-AND-CONTINUE — no host loses the ability to boot. This
# is the lesson of #4266/#4341: a default-fail-closed drop broke EVERY host that
# does not grant CAP_SETPCAP (GitHub runners, Brev shadecloud, Colossus Ubuntu
# 24.04, Docker Desktop, WSL) and was reverted within hours. Inverting the
# default to opt-in keeps that regression off by default.
#
# Scope: the AGENT process tree only. A `nemoclaw connect` shell is spawned by
# the container runtime outside that tree and inherits the container's OCI
# bounding set; tightening that requires cap_drop at sandbox create, tracked
# upstream in NVIDIA/OpenShell#1452.
#
# Test seam: NEMOCLAW_PROC_STATUS overrides the status source so unit tests can
# feed a known CapBnd fixture without a real /proc.
enforce_cap_drop_if_required() {
[ "${NEMOCLAW_REQUIRE_CAP_DROP:-}" = "1" ] || return 0

local status_path="${NEMOCLAW_PROC_STATUS:-/proc/self/status}"
local cap_bnd_hex present reason=""
cap_bnd_hex=$(awk '/^CapBnd:/{print $2}' "$status_path" 2>/dev/null || true)
if [ -z "$cap_bnd_hex" ]; then
# Cannot verify → in strict mode, refuse rather than assume safety.
reason="could not read bounding set from ${status_path} — cannot verify drop"
elif ! present="$(dangerous_caps_in_capbnd "$cap_bnd_hex")"; then
# Non-empty but unparseable CapBnd is equally unverifiable → refuse.
reason="could not parse bounding set (CapBnd=${cap_bnd_hex}) — cannot verify drop"
elif [ -n "$present" ]; then
reason="dangerous caps remain in bounding set (CapBnd=${cap_bnd_hex}): ${present}"
fi
[ -n "$reason" ] || return 0

cat >&2 <<'EOF'

┌─ [SECURITY] Refusing to start sandbox: bounding-set capability drop failed ──
│ NEMOCLAW_REQUIRE_CAP_DROP=1 is set, so NemoClaw refuses to start a sandbox
│ that still holds dangerous bounding-set capabilities. The runtime could not
│ drop them (capsh or CAP_SETPCAP unavailable on this host), so they remain.
│ To run anyway with the weaker (warn-only) posture, unset the variable:
│ unset NEMOCLAW_REQUIRE_CAP_DROP
│ Tracking: https://github.com/NVIDIA/NemoClaw/issues/3280
└──────────────────────────────────────────────────────────────────────────────
EOF
echo "[SECURITY] ${reason}" >&2
exit 1
}

# Emit a loud diagnostic when capsh-based dropping is unavailable so that
Expand All @@ -249,31 +369,19 @@ drop_capabilities() {
report_residual_capabilities() {
echo "[SECURITY] CAP_SETPCAP not available — cannot drop bounding-set caps via capsh" >&2

local cap_bnd_hex val name bit present_caps=""
if ! cap_bnd_hex=$(awk '/^CapBnd:/{print $2}' /proc/self/status 2>/dev/null) \
local status_path="${NEMOCLAW_PROC_STATUS:-/proc/self/status}"
local cap_bnd_hex present
if ! cap_bnd_hex=$(awk '/^CapBnd:/{print $2}' "$status_path" 2>/dev/null) \
|| [ -z "$cap_bnd_hex" ]; then
echo "[SECURITY] Could not read /proc/self/status — residual caps unknown" >&2
echo "[SECURITY] Could not read ${status_path} — residual caps unknown" >&2
return 0
fi
echo "[SECURITY] Residual CapBnd=${cap_bnd_hex}" >&2

# Bash arithmetic handles 64-bit ints on 64-bit platforms; CAP_LAST_CAP
# is ~41 today, well within range. Avoids a gawk-strtonum dependency.
val=$((16#$cap_bnd_hex))
for entry in \
"21:cap_sys_admin" \
"19:cap_sys_ptrace" \
"13:cap_net_raw" \
"1:cap_dac_override" \
"10:cap_net_bind_service"; do
bit="${entry%%:*}"
name="${entry#*:}"
if [ $(((val >> bit) & 1)) -ne 0 ]; then
present_caps="${present_caps:+$present_caps,}$name"
fi
done
if [ -n "$present_caps" ]; then
echo "[SECURITY] Dangerous caps remain in bounding set: ${present_caps}" >&2
if ! present="$(dangerous_caps_in_capbnd "$cap_bnd_hex")"; then
echo "[SECURITY] Could not parse CapBnd=${cap_bnd_hex} — residual caps unknown" >&2
elif [ -n "$present" ]; then
echo "[SECURITY] Dangerous caps remain in bounding set: ${present}" >&2
fi
}

Expand Down Expand Up @@ -331,13 +439,13 @@ init_step_down_prefixes() {
local drop="-setuid,-setgid,-fowner,-chown,-kill"
# shellcheck disable=SC2034 # consumed by entrypoint scripts (cross-file)
STEP_DOWN_PREFIX_SANDBOX=(
setpriv --reuid=sandbox --regid=sandbox --init-groups
--bounding-set="$drop" --
setpriv "--reuid=sandbox" "--regid=sandbox" --init-groups
"--bounding-set=$drop" --
)
# shellcheck disable=SC2034 # consumed by entrypoint scripts (cross-file)
STEP_DOWN_PREFIX_GATEWAY=(
setpriv --reuid=gateway --regid=gateway --init-groups
--bounding-set="$drop" --
setpriv "--reuid=gateway" "--regid=gateway" --init-groups
"--bounding-set=$drop" --
)
else
echo "[SECURITY WARNING] setpriv or CAP_SETPCAP unavailable — falling back to gosu (bounding set will retain cap_setuid/setgid/fowner/chown/kill — issue #3280)" >&2
Expand Down
Loading
Loading