Skip to content

fix(device-connect): security hardening for the Device Connect integration#2

Merged
kavya-chennoju merged 1 commit into
kavya-chennoju:feat/device-connect-on-mainfrom
cagataycali:feat/device-connect-security-hardening
Jun 12, 2026
Merged

fix(device-connect): security hardening for the Device Connect integration#2
kavya-chennoju merged 1 commit into
kavya-chennoju:feat/device-connect-on-mainfrom
cagataycali:feat/device-connect-security-hardening

Conversation

@cagataycali

Copy link
Copy Markdown

Defence-in-depth hardening for the Device Connect integration across the agent dispatch path, the device drivers, and the Reachy transport. All changes preserve existing behaviour on the trusted-path defaults.

Depends on arm/device-connect#54 (get_rpc_source_device() hook), which the driver-side caller authorization uses.

Findings addressed

# Finding Fix
01 Broadcast re-parses raw caller command (TOCTOU — executed action could differ from approved one) Dispatch the validated, operator-approved command through Device Connect; no raw json.loads in the broadcast branch
02 policy_provider SSRF — inference could be pointed at an arbitrary endpoint Validate policy_provider against the existing allowlist in both robot_driver.execute and sim_driver.execute before start
03 Device-native rpc not HITL-gated Route rpc through the human-in-the-loop approval gate, surfacing the function name
04 Path traversal / query injection in Reachy playMove move_name Strict charset validation before interpolating into the daemon REST path
05 Unauthenticated Reachy daemon (REST + WebSocket) Optional bearer token (REACHY_DAEMON_TOKEN); loud warning when the link is unauthenticated
06 No RPC authz + spoofable e-stop Per-call caller ACL on state-mutating RPCs (execute/stop/step/reset) + emergencyStop source allowlist. Env-driven (DEVICE_CONNECT_RPC_ALLOW / DEVICE_CONNECT_ESTOP_ALLOW); permissive-with-warning when unset, fail-closed when set
07 Insecure-by-default transport allow_insecure defaults to False (opt-in via arg or DEVICE_CONNECT_ALLOW_INSECURE); warning logged when active; removed the agent-side setdefault that forced insecure mode process-wide. GUIDE.md updated to document secure-by-default + explicit opt-in

Changes

  • tools/robot_mesh.py: validated-command plumbing for broadcast; rpc HITL gate
  • device_connect/robot_driver.py, sim_driver.py: policy_provider allowlist + caller ACL + e-stop source check
  • device_connect/reachy_mini_driver.py, reachy_transport.py: move_name validation + bearer-token auth
  • device_connect/__init__.py: secure-by-default transport; removed forced insecure setdefault
  • device_connect/_authz.py: new ACL helper (env-driven, fail-closed)
  • device_connect/GUIDE.md: secure-by-default docs + insecure opt-in for local D2D
  • .github/workflows/test-lint.yml: pin device-connect-edge/agent-tools to source via UV_OVERRIDE when a PR touches the DC integration
  • tests/test_device_connect_hardening.py: covers all of the above

Tests

```
1142 passed — drivers + hardening + all_robots + mesh security + mesh tool (CI collection order)
54 passed — device-connect-edge framework (incl. new get_rpc_source_device hook tests)
```

Note for @kavya-chennoju

There's a pre-existing test-isolation fragility in tests/test_device_connect_all_robots.py (mocks device_connect_edge in sys.modules but not device_connect_edge.messaging; fails when it runs before test_device_connect_drivers.py). Verified by stashing all of this PR's work — pristine branch fails the identical 6 tests. This PR introduces zero new failures. Flagging it as a good follow-up cleanup, but it's out of scope here.

…ation

Harden the Device Connect integration across the agent dispatch path,
the device drivers, and the Reachy transport. All changes are
defence-in-depth and preserve existing behaviour on the trusted-path
defaults.

- robot_mesh broadcast: dispatch the validated, operator-approved
  command through Device Connect instead of re-parsing the raw caller
  string, so the executed action cannot differ from the approved one.
- robot_mesh rpc: route device-native rpc through the human-in-the-loop
  approval gate (surfacing the function name) so device-native
  functions are not invoked without explicit approval.
- drivers: validate policy_provider against the existing allowlist in
  both robot and sim execute() before starting a policy, preventing
  inference from being pointed at an arbitrary endpoint.
- drivers: add per-call caller authorization on state-mutating RPCs
  (execute/stop/step/reset) and validate the source of emergencyStop
  against an allowlist. Env-driven (DEVICE_CONNECT_RPC_ALLOW /
  DEVICE_CONNECT_ESTOP_ALLOW); permissive with a warning when unset,
  fail-closed when set. Uses get_rpc_source_device() from
  device-connect-edge.
- reachy: validate playMove move_name against a strict charset before
  interpolating it into the daemon REST path (prevents path traversal /
  query injection).
- reachy: support an optional bearer token (REACHY_DAEMON_TOKEN) on the
  daemon WebSocket and REST interfaces and warn loudly when the link is
  unauthenticated.
- transport: make Device Connect secure by default. allow_insecure now
  defaults to False and must be opted into explicitly (arg or
  DEVICE_CONNECT_ALLOW_INSECURE); a warning is logged whenever insecure
  mode is active. Removed the agent-side setdefault that forced insecure
  mode process-wide.
- GUIDE.md: document secure-by-default and the explicit insecure opt-in
  for local D2D trials.
- CI: when a PR touches the Device Connect integration, pin
  device-connect-edge / agent-tools to source via UV_OVERRIDE so the
  matching framework version is exercised.
- tests: add tests/test_device_connect_hardening.py covering all of the
  above.
@kavya-chennoju

kavya-chennoju commented Jun 12, 2026

Copy link
Copy Markdown
Owner

@cagataycali This is awesome, thanks for the help.

follow-up fix for the caller-authorization work, ready to pull into this PR.

What was wrong

A live E2E (real Robot('so100') sim driven over Zenoh via robot_mesh) showed the per-call caller ACL is all-or-nothing on the agent path. device-connect-agent-tools clients are anonymous, so the device always sees caller=None. With DEVICE_CONNECT_RPC_ALLOW set, every agent call is denied and there's no reachable allow-path:

# agent → so100 (DEVICE_CONNECT_RPC_ALLOW=trusted-controller)
{"status":"error","reason":"caller not authorized for 'execute'","caller":"unknown"}
robot log: Rejected unauthorized Device Connect RPC execute from caller=None

The unit tests passed source_device= straight into the driver, masking this — that state isn't reachable through the documented agent flow.

The fix (3 layers)

  • Layer 1 — identity propagation (robot_mesh.py): stamp _dc_meta.source_device into the DC command envelope from STRANDS_ROBOT_MESH_AGENT_ID / DEVICE_CONNECT_CLIENT_ID. Anonymous callers stay fail-closed.
  • Layer 2 — honest authz (_authz.py): document that the id is only a hard boundary under authenticated transport; log a one-time advisory when an allowlist is enforced under DEVICE_CONNECT_ALLOW_INSECURE (self-asserted id).
  • Layer 3 — docs + tests: extract resolve_allow_insecure() + fix the stale "defaults to True" docstring; fix the contradictory DEVICE_CONNECT_ALLOW_INSECURE=true in the GUIDE's mTLS section; replace the tautological test_allow_insecure_defaults_false; add coverage for the reachable anonymous-deny, the insecure advisory, and _dc_meta propagation. 21 → 29 tests.

Re-run on the live sim after the fix: anonymous → denied; STRANDS_ROBOT_MESH_AGENT_ID=trusted-controllersuccess.

How to pull it in

The commit (f76d21d) is based directly on this PR's current head (fd66308), so it's a clean fast-forward. From your local checkout on this branch:

git fetch https://github.com/kavya-chennoju/robots feat/device-connect-security-hardening
git merge --ff-only FETCH_HEAD
git push origin feat/device-connect-security-hardening

(or git cherry-pick f76d21d if you'd rather take just the one commit). It's also open as #3 against feat/device-connect-on-main if you'd prefer to review/land it there.

Heads-up

Needs the merged arm/device-connect#54 (get_rpc_source_device) — CI resolves it via UV_OVERRIDE; local runs need the updated device-connect-edge installed.

@kavya-chennoju kavya-chennoju merged commit eae4c1f into kavya-chennoju:feat/device-connect-on-main Jun 12, 2026
kavya-chennoju pushed a commit that referenced this pull request Jun 13, 2026
…ands-labs#372)

* mesh+tools: harden actuation, telemetry, and inference control surfaces

Tighten five security-relevant surfaces in the mesh networking layer and
the GR00T inference tool so that an untrusted agent context (prompt
injection, manipulated tool output) cannot drive physical actuation, leak
cross-peer telemetry, or reach the host filesystem.

ACL examples (Zenoh)
- Rewrite examples/mesh_acl_example.json5 to scope subscribe by ROLE: the
  robot role gets a narrow declare_subscriber rule (presence, broadcast,
  safety, own cmd/response) instead of a blanket `**`; the operator role
  keeps broad observe as an explicit, intentional grant.
- Add examples/mesh_acl_strict_per_peer.json5 for operators who need hard
  per-peer subscribe isolation (one rule+subject per robot CN), documenting
  the Zenoh 1.x literal-CN limitation.

policy_host (mesh/core.py)
- Add a defence-in-depth is_safe_policy_host() re-check inside _dispatch so
  an off-allowlist VLA policy host is rejected even if a caller reaches
  _dispatch without going through validate_command.

robot_mesh HITL (tools/robot_mesh.py)
- Make the human-in-the-loop interrupt set consumer-configurable via
  STRANDS_MESH_HITL_ACTIONS (all / none / explicit subset). The default now
  gates every physical-actuation action (emergency_stop, broadcast, tell,
  send, stop), not just the two fleet-wide ones. Bad env tokens fail loud.
- Pre-validate tell/send/broadcast command bodies BEFORE the interrupt so an
  operator never approves a payload the validator then rejects.
- Add a subscribe topic allowlist (STRANDS_MESH_SUBSCRIBE_ALLOW; defaults to
  low-impact shared classes) so the tool cannot subscribe to another peer's
  control/sensor streams; audit every inbox read.

gr00t_inference (tools/gr00t_inference.py)
- Drop volumes / image_name / container_command from the agent-facing tool
  signature. Container topology is now operator-config-driven: the image is
  resolved from STRANDS_GR00T_IMAGE and validated against an allowlist
  (extendable via STRANDS_GR00T_IMAGE_ALLOW); volumes default to the
  checkpoint + HF cache mounts only; the container command is fixed.
- Guard the private _start_container entry point too: reject off-allowlist
  images and any bind-mount of a protected host path (root fs, system dirs,
  credential dirs, docker socket).

Docs + tests
- Document the new STRANDS_* env vars in README Configuration.
- Add regression tests for every behaviour above (each fails on pre-fix
  code) and update existing tests to the new defaults.

* fix(gr00t): block children of protected host paths in volume guard

Address [MUST FIX] review on PR strands-labs#372: _check_volume_safety used exact-match
set membership, so bind-mounts of CHILDREN of protected dirs bypassed the
guard (/etc/shadow, /root/.ssh/id_rsa, /home/<u>/.aws/credentials,
/proc/1/environ, /var/run/docker.sock.bak). Switched to a prefix check
(reject the dir itself AND any child). Root '/' is matched exactly only so a
prefix test there doesn't reject every absolute path / legitimate operator
mount.

Also fix 2 CodeQL findings:
- robot_mesh: _NONE_OPT_OUT_WARNED reassigned via 'global' flagged as unused;
  switched to a one-element list flag (no module-global reassignment).
- test_gr00t_container_hardening: module imported with both 'import' and
  'import from'; dropped the from-import, use gi.* throughout.

Regression tests: parametrized child-path rejections + legit-mount
allow-list (fail on pre-fix exact-match code). Suites: 1376 passed, 2 skipped.

* review(tools): R1 -- allowlist container-hardening test in host-path sweep (CI fix)

tests/tools/test_gr00t_container_hardening.py contains /home/<user>/
paths as test DATA (asserting the production code REJECTS these paths).
The test_no_host_paths.py sweep correctly catches these literals, but
the file is a security test that must carry protected-path samples.
Add it to ALLOWED_FILES alongside the existing path-validation allowlist
entry which has the same rationale.

* fix(mesh): use real global flag for HITL opt-out once-warning

CodeQL py/unused-global-variable flagged _NONE_OPT_OUT_WARNED because the
one-element-list flag was only ever subscript-mutated, never re-bound, so
the module-level name binding looked write-only to the analysis. Replace
the list-hack with a real module bool plus a _warn_none_opt_out_once()
helper that owns the global rebind, making the use statically visible.

The list form provided no atomicity anyway; a benign log-once race under
concurrency is acceptable for a one-shot warning. _reset_interrupt_actions_cache
now also clears the flag so test isolation is unchanged.

* test(gr00t): load module via import_module to satisfy py/import-and-import-from

CodeQL py/import-and-import-from flagged the plain
'import strands_robots.tools.gr00t_inference as gi' because the same module
is imported via 'from ...gr00t_inference import gr00t_inference' elsewhere
in the package (a TYPE_CHECKING stub in strands_robots/__init__.py).

The obvious 'from strands_robots.tools import gr00t_inference as gi' does not
work: the tools package lazy __getattr__ resolves that name to the tool
function (a DecoratedFunctionTool), not the module, so it lacks subprocess /
_is_allowed_image / _start_container that these white-box tests need (incl.
patch.object(gi.subprocess, ...)). importlib.import_module returns the
canonical sys.modules entry without a plain dotted import statement, and is
the package's own module-loading idiom (both __init__.py files use it).

gi stays bound to the module object, so no other lines change.

* fix(robot_mesh): eliminate module-global warn-once flag to satisfy CodeQL

CodeQL py/unused-global-variable kept firing on _NONE_OPT_OUT_WARNED even
after the R2 module-bool rewrite (alerts 339/220 fired post-2a2faa9). The
analyzer does not recognize the conditional rebind inside the warn-once
guard as a use of the global.

Replace the module-global bool + manual guard with a functools.lru_cache
(maxsize=1) nullary function: first call logs, subsequent calls return the
memoized None without re-logging. No module-level mutable state remains, so
the alert has no name to flag. cache_clear() in _reset_interrupt_actions_cache
preserves test isolation. Behaviour (warn-once per process) is unchanged.

All 40 HITL/security/subscribe-allowlist tests pass; ruff/mypy clean.

* review(gr00t): R3 -- validate hf_local_dir against volume blocklist (addresses thread gr00t_inference.py:486)

* review(gr00t): R4 -- validate hf_local_dir at dispatch boundary to close download host-fs sink

hf_local_dir reaches two host-fs sinks but only the bind-mount sink was
guarded: _download_checkpoint writes the HF snapshot to it directly on the
host (no docker mediation), and lifecycle=full downloads before it starts the
container, so action=download_checkpoint / action=lifecycle bypassed the R3
_start_container guard entirely.

Hoist validation to the agent dispatch boundary via a shared
_check_hf_local_dir_safety helper (reuses the expand + prefix-match blocklist),
so every action rejects a prompt-injected path once at the untrusted-input
entry point. Keep defence-in-depth guards at the _download_checkpoint and
_start_container internal entry points for callers that bypass the tool.

Pin: tests/tools/test_gr00t_container_hardening.py::TestHfLocalDirDownloadSafety
(download/lifecycle dispatch + internal entry + helper; 16 cases, all fail on
pre-fix code -- host writes/docker reached).

* review(mesh): R5 -- apply subscribe allowlist to watch action (closes telemetry-exfiltration parallel channel)

* security(gr00t): remove agent-controlled build source repo/tag/dir

build_image / lifecycle clone a git repo and run `bash docker/build.sh`
from it. With repo_url / repo_tag / source_dir on the agent-facing tool
signature, a prompt-injected agent could clone an attacker tree and
execute its build script -- host RCE under the agent UID -- bypassing the
hardening that already removed image_name / volumes / container_command
for the same reason.

Mirror the established operator-config + allowlist pattern (the one used
for the container image):

- Drop repo_url / repo_tag / source_dir from the @tool signature (one-way
  door: done before 0.4.0 ships these as public params).
- Resolve the clone URL/tag from operator env (STRANDS_GR00T_REPO_URL /
  STRANDS_GR00T_REPO_TAG), validated at the dispatch boundary before any
  action branch forwards them; fail closed on a misconfigured env.
- URL allowlist is EXACT match (defaults to the canonical NVIDIA repo,
  with/without .git; extend via STRANDS_GR00T_REPO_URL_ALLOW). A trailing
  wildcard like the image allowlist would let `...Isaac-GR00T-evil` slip
  past, so URLs are matched literally.
- repo_tag is shape-validated (`^[A-Za-z0-9][A-Za-z0-9._/-]*$`) so it
  cannot be read by git as an option (e.g. --upload-pack) or carry shell
  metacharacters into the clone/checkout.
- Reject a leading `-` on the URL (git argument injection).
- source_dir is removed entirely; the clone destination is fixed to
  _isaac_gr00t_dir() (operator base dir), never caller-supplied.
- Defence in depth: _build_image re-asserts the URL allowlist + tag-shape
  guard, protecting the internal/operator/test entry point too.

Regression pins (tests/tools/test_gr00t_container_hardening.py):
- TestBuildSourceNotAgentControllable: signature drops the three params;
  passing any of them raises TypeError; off-allowlist URL and unsafe tag
  fail closed for build_image and lifecycle before any git/bash runs;
  internal _build_image entry is guarded; the default URL clones to the
  fixed dir; operators can extend the URL allowlist.
- test_no_agent_controlled_host_exec_or_path_params: structural invariant
  pinning that NO host-exec/mount-shaped param (repo_url, repo_tag,
  source_dir, image_name, volumes, container_command) is ever
  reintroduced to the agent surface -- closes the vulnerability class, not
  just this instance.

Updated tests/tools/test_gr00t_inference.py TestBuildImage and the
_lifecycle default kwargs to the new contract (allowlisted URL, no
source_dir, clone dest via patched _isaac_gr00t_dir).

* harden: close 3 pentest findings on PR strands-labs#372

PoC-driven follow-up to the mesh+inference hardening PR.

Finding A (HIGH): //etc bypassed _check_volume_safety / _check_hf_local_dir_safety
  os.path.normpath('//etc') == '//etc' on POSIX (preserves leading double slash)
  but the kernel collapses // to / at lookup, so docker -v //etc:/x mounts /etc.
  Fix: collapse runs of leading slashes in _normalize_host_path before normpath.

Finding B (LOW, defence in depth): _is_allowed_image accepted shell metacharacters
  in the tag (gr00t:;rm, gr00t:$(rm), etc.) because allowlist match was a plain
  startswith(prefix). Argv-mode subprocess prevents shell exec, but the value is
  still attacker-influenceable via STRANDS_GR00T_IMAGE env paste.
  Fix: gate on [A-Za-z0-9._:/@-]+ charset before allowlist comparison.

Finding C (HIGH): _ke_matches treated '**' literally, so the default subscribe
  allowlist (**/presence, **/health, **/safety/**) matched no real Zenoh keys
  like 'robot1/presence' or 'a/b/safety/event'. Operators would set
  STRANDS_MESH_SUBSCRIBE_ALLOW=** to make subscribes work, fully defeating
  the hardening.
  Fix: expand leading **/ (zero or more leading segments), trailing /** (prefix
  or deeper), and both-ends **/X/** (X anywhere).

Tests: 44 new regression tests in test_gr00t_pentest_regressions.py covering
  every PoC vector + legitimate-path/-image/-key positives. All 139 hardening
  + regression tests pass.

* style: ruff format test_gr00t_pentest_regressions.py

* harden(R7): reject wildcard targets in robot_mesh watch

R7 must-fix #2: 'target' is interpolated into 'strands/{target}/stream'
and matched against the subscribe allowlist. If an operator extended the
allowlist with a wildcard pattern (e.g. STRANDS_MESH_SUBSCRIBE_ALLOW=
'strands/*/stream' per the README example and the
test_watch_allowed_when_operator_extends_allowlist test), then the agent
could call watch(target='*') or watch(target='**') and the wildcard
target passed the allowlist match by literal equality / trailing-/**
branch -- reaching mesh.on_stream('*') and subscribing to every peer's
stream (the cross-peer telemetry-leak this surface exists to close).

Fix: validate 'target' as a literal peer id BEFORE interpolating into
the watch_key, using a regex pattern (alphanumerics + ._-, max 64 chars,
first char alphanumeric). Mirrors the _REPO_TAG_RE shape-validation
pattern in gr00t_inference.py for the same class of attack. The HITL
bypass exception is intentionally NOT applied to the shape check --
otherwise the operator could be tricked into approving a single-peer-
shaped reason string while the agent slipped a wildcard past.

Pin tests (tests/mesh/test_robot_mesh_subscribe_allowlist.py):
- test_watch_rejects_wildcard_targets_even_with_permissive_allowlist
  parametric across [*, **, */state, peer-a/state, ../etc, ''] each
  with STRANDS_MESH_SUBSCRIBE_ALLOW=strands/*/stream
- test_watch_rejects_wildcard_even_when_in_hitl_set_and_approved
  proves HITL approval can't legitimise wildcard
- test_watch_accepts_literal_peer_ids_with_extended_allowlist
  parametric across legitimate ids (peer-b, robot_1, robot.local, etc.)
- test_watch_rejects_overlong_peer_id (64-char boundary)
- test_watch_rejects_leading_special_char (-rm, .dot, _under, /abs)

R7 must-fix #1 (//etc bypass via os.path.normpath POSIX double-slash
preservation): already closed in commit 4cdbfa8 -- _normalize_host_path
collapses runs of leading slashes BEFORE normpath, and pin tests in
tests/tools/test_gr00t_pentest_regressions.py cover //etc, //etc/shadow,
///etc, etc. The R7 review thread on that surface was filed against
stale code (pre-4cdbfa8); no additional change required.

Tests: 155 passed (subscribe_allowlist + pentest_regressions +
container_hardening), full mesh suite 1355 passed / 2 skipped.
ruff check clean, ruff format clean.

Per AGENTS.md > Review Learnings (PR strands-labs#92) > 'LLM Input Safety':
allowlist enumerable values; reject shell metacharacters in paths.

* test: allowlist pentest-regression file in host-path hygiene guard

The R7 pentest regression suite (test_gr00t_pentest_regressions.py) feeds
protected host paths as attack input -- including //home/user/.aws/credentials
as a leading-double-slash bypass vector -- and asserts the production guard
REJECTS each one. The repo-hygiene sweep test_no_host_specific_absolute_paths
matched that literal against its /home/<name>/ pattern and failed CI, the same
class of false positive the suite already allowlists test_gr00t_container_
hardening.py for (security test data that proves rejection, not host coupling).

Add test_gr00t_pentest_regressions.py to ALLOWED_FILES with the same rationale.
No production code touched; the attack-input fixtures stay intact so the
double-slash and protected-path regressions keep their teeth.

Verified:
- tests/test_no_host_paths.py: 1 passed (was failing on head)
- tests/tools/test_gr00t_pentest_regressions.py: 44 passed
- ruff check + format clean

---------

Co-authored-by: strands-agent <217235299+strands-agent@users.noreply.github.com>
Co-authored-by: strands-robots <strands-robots@users.noreply.github.com>
Co-authored-by: strands-coder <agent@strands.local>
Co-authored-by: cagataycali <cagataycali@users.noreply.github.com>
Co-authored-by: DevDuck <devduck@cagatay.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants