fix(device-connect): security hardening for the Device Connect integration by cagataycali · Pull Request #2 · kavya-chennoju/robots

cagataycali · 2026-06-11T21:10:53Z

Defence-in-depth hardening for the Device Connect integration across the agent dispatch path, the device drivers, and the Reachy transport. All changes preserve existing behaviour on the trusted-path defaults.

Depends on arm/device-connect#54 (get_rpc_source_device() hook), which the driver-side caller authorization uses.

Findings addressed

#	Finding	Fix
01	Broadcast re-parses raw caller command (TOCTOU — executed action could differ from approved one)	Dispatch the validated, operator-approved command through Device Connect; no raw `json.loads` in the broadcast branch
02	`policy_provider` SSRF — inference could be pointed at an arbitrary endpoint	Validate `policy_provider` against the existing allowlist in both `robot_driver.execute` and `sim_driver.execute` before start
03	Device-native `rpc` not HITL-gated	Route `rpc` through the human-in-the-loop approval gate, surfacing the function name
04	Path traversal / query injection in Reachy `playMove` `move_name`	Strict charset validation before interpolating into the daemon REST path
05	Unauthenticated Reachy daemon (REST + WebSocket)	Optional bearer token (`REACHY_DAEMON_TOKEN`); loud warning when the link is unauthenticated
06	No RPC authz + spoofable e-stop	Per-call caller ACL on state-mutating RPCs (execute/stop/step/reset) + `emergencyStop` source allowlist. Env-driven (`DEVICE_CONNECT_RPC_ALLOW` / `DEVICE_CONNECT_ESTOP_ALLOW`); permissive-with-warning when unset, fail-closed when set
07	Insecure-by-default transport	`allow_insecure` defaults to False (opt-in via arg or `DEVICE_CONNECT_ALLOW_INSECURE`); warning logged when active; removed the agent-side `setdefault` that forced insecure mode process-wide. GUIDE.md updated to document secure-by-default + explicit opt-in

Changes

tools/robot_mesh.py: validated-command plumbing for broadcast; rpc HITL gate
device_connect/robot_driver.py, sim_driver.py: policy_provider allowlist + caller ACL + e-stop source check
device_connect/reachy_mini_driver.py, reachy_transport.py: move_name validation + bearer-token auth
device_connect/__init__.py: secure-by-default transport; removed forced insecure setdefault
device_connect/_authz.py: new ACL helper (env-driven, fail-closed)
device_connect/GUIDE.md: secure-by-default docs + insecure opt-in for local D2D
.github/workflows/test-lint.yml: pin device-connect-edge/agent-tools to source via UV_OVERRIDE when a PR touches the DC integration
tests/test_device_connect_hardening.py: covers all of the above

Tests

```
1142 passed — drivers + hardening + all_robots + mesh security + mesh tool (CI collection order)
54 passed — device-connect-edge framework (incl. new get_rpc_source_device hook tests)
```

Note for @kavya-chennoju

There's a pre-existing test-isolation fragility in tests/test_device_connect_all_robots.py (mocks device_connect_edge in sys.modules but not device_connect_edge.messaging; fails when it runs before test_device_connect_drivers.py). Verified by stashing all of this PR's work — pristine branch fails the identical 6 tests. This PR introduces zero new failures. Flagging it as a good follow-up cleanup, but it's out of scope here.

…ation Harden the Device Connect integration across the agent dispatch path, the device drivers, and the Reachy transport. All changes are defence-in-depth and preserve existing behaviour on the trusted-path defaults. - robot_mesh broadcast: dispatch the validated, operator-approved command through Device Connect instead of re-parsing the raw caller string, so the executed action cannot differ from the approved one. - robot_mesh rpc: route device-native rpc through the human-in-the-loop approval gate (surfacing the function name) so device-native functions are not invoked without explicit approval. - drivers: validate policy_provider against the existing allowlist in both robot and sim execute() before starting a policy, preventing inference from being pointed at an arbitrary endpoint. - drivers: add per-call caller authorization on state-mutating RPCs (execute/stop/step/reset) and validate the source of emergencyStop against an allowlist. Env-driven (DEVICE_CONNECT_RPC_ALLOW / DEVICE_CONNECT_ESTOP_ALLOW); permissive with a warning when unset, fail-closed when set. Uses get_rpc_source_device() from device-connect-edge. - reachy: validate playMove move_name against a strict charset before interpolating it into the daemon REST path (prevents path traversal / query injection). - reachy: support an optional bearer token (REACHY_DAEMON_TOKEN) on the daemon WebSocket and REST interfaces and warn loudly when the link is unauthenticated. - transport: make Device Connect secure by default. allow_insecure now defaults to False and must be opted into explicitly (arg or DEVICE_CONNECT_ALLOW_INSECURE); a warning is logged whenever insecure mode is active. Removed the agent-side setdefault that forced insecure mode process-wide. - GUIDE.md: document secure-by-default and the explicit insecure opt-in for local D2D trials. - CI: when a PR touches the Device Connect integration, pin device-connect-edge / agent-tools to source via UV_OVERRIDE so the matching framework version is exercised. - tests: add tests/test_device_connect_hardening.py covering all of the above.

kavya-chennoju · 2026-06-12T07:02:21Z

@cagataycali This is awesome, thanks for the help.

follow-up fix for the caller-authorization work, ready to pull into this PR.

What was wrong

A live E2E (real Robot('so100') sim driven over Zenoh via robot_mesh) showed the per-call caller ACL is all-or-nothing on the agent path. device-connect-agent-tools clients are anonymous, so the device always sees caller=None. With DEVICE_CONNECT_RPC_ALLOW set, every agent call is denied and there's no reachable allow-path:

# agent → so100 (DEVICE_CONNECT_RPC_ALLOW=trusted-controller)
{"status":"error","reason":"caller not authorized for 'execute'","caller":"unknown"}
robot log: Rejected unauthorized Device Connect RPC execute from caller=None

The unit tests passed source_device= straight into the driver, masking this — that state isn't reachable through the documented agent flow.

The fix (3 layers)

Layer 1 — identity propagation (robot_mesh.py): stamp _dc_meta.source_device into the DC command envelope from STRANDS_ROBOT_MESH_AGENT_ID / DEVICE_CONNECT_CLIENT_ID. Anonymous callers stay fail-closed.
Layer 2 — honest authz (_authz.py): document that the id is only a hard boundary under authenticated transport; log a one-time advisory when an allowlist is enforced under DEVICE_CONNECT_ALLOW_INSECURE (self-asserted id).
Layer 3 — docs + tests: extract resolve_allow_insecure() + fix the stale "defaults to True" docstring; fix the contradictory DEVICE_CONNECT_ALLOW_INSECURE=true in the GUIDE's mTLS section; replace the tautological test_allow_insecure_defaults_false; add coverage for the reachable anonymous-deny, the insecure advisory, and _dc_meta propagation. 21 → 29 tests.

Re-run on the live sim after the fix: anonymous → denied; STRANDS_ROBOT_MESH_AGENT_ID=trusted-controller → success.

How to pull it in

The commit (f76d21d) is based directly on this PR's current head (fd66308), so it's a clean fast-forward. From your local checkout on this branch:

git fetch https://github.com/kavya-chennoju/robots feat/device-connect-security-hardening
git merge --ff-only FETCH_HEAD
git push origin feat/device-connect-security-hardening

(or git cherry-pick f76d21d if you'd rather take just the one commit). It's also open as #3 against feat/device-connect-on-main if you'd prefer to review/land it there.

Heads-up

Needs the merged arm/device-connect#54 (get_rpc_source_device) — CI resolves it via UV_OVERRIDE; local runs need the updated device-connect-edge installed.

@tool

…ands-labs#372) * mesh+tools: harden actuation, telemetry, and inference control surfaces Tighten five security-relevant surfaces in the mesh networking layer and the GR00T inference tool so that an untrusted agent context (prompt injection, manipulated tool output) cannot drive physical actuation, leak cross-peer telemetry, or reach the host filesystem. ACL examples (Zenoh) - Rewrite examples/mesh_acl_example.json5 to scope subscribe by ROLE: the robot role gets a narrow declare_subscriber rule (presence, broadcast, safety, own cmd/response) instead of a blanket `**`; the operator role keeps broad observe as an explicit, intentional grant. - Add examples/mesh_acl_strict_per_peer.json5 for operators who need hard per-peer subscribe isolation (one rule+subject per robot CN), documenting the Zenoh 1.x literal-CN limitation. policy_host (mesh/core.py) - Add a defence-in-depth is_safe_policy_host() re-check inside _dispatch so an off-allowlist VLA policy host is rejected even if a caller reaches _dispatch without going through validate_command. robot_mesh HITL (tools/robot_mesh.py) - Make the human-in-the-loop interrupt set consumer-configurable via STRANDS_MESH_HITL_ACTIONS (all / none / explicit subset). The default now gates every physical-actuation action (emergency_stop, broadcast, tell, send, stop), not just the two fleet-wide ones. Bad env tokens fail loud. - Pre-validate tell/send/broadcast command bodies BEFORE the interrupt so an operator never approves a payload the validator then rejects. - Add a subscribe topic allowlist (STRANDS_MESH_SUBSCRIBE_ALLOW; defaults to low-impact shared classes) so the tool cannot subscribe to another peer's control/sensor streams; audit every inbox read. gr00t_inference (tools/gr00t_inference.py) - Drop volumes / image_name / container_command from the agent-facing tool signature. Container topology is now operator-config-driven: the image is resolved from STRANDS_GR00T_IMAGE and validated against an allowlist (extendable via STRANDS_GR00T_IMAGE_ALLOW); volumes default to the checkpoint + HF cache mounts only; the container command is fixed. - Guard the private _start_container entry point too: reject off-allowlist images and any bind-mount of a protected host path (root fs, system dirs, credential dirs, docker socket). Docs + tests - Document the new STRANDS_* env vars in README Configuration. - Add regression tests for every behaviour above (each fails on pre-fix code) and update existing tests to the new defaults. * fix(gr00t): block children of protected host paths in volume guard Address [MUST FIX] review on PR strands-labs#372: _check_volume_safety used exact-match set membership, so bind-mounts of CHILDREN of protected dirs bypassed the guard (/etc/shadow, /root/.ssh/id_rsa, /home/<u>/.aws/credentials, /proc/1/environ, /var/run/docker.sock.bak). Switched to a prefix check (reject the dir itself AND any child). Root '/' is matched exactly only so a prefix test there doesn't reject every absolute path / legitimate operator mount. Also fix 2 CodeQL findings: - robot_mesh: _NONE_OPT_OUT_WARNED reassigned via 'global' flagged as unused; switched to a one-element list flag (no module-global reassignment). - test_gr00t_container_hardening: module imported with both 'import' and 'import from'; dropped the from-import, use gi.* throughout. Regression tests: parametrized child-path rejections + legit-mount allow-list (fail on pre-fix exact-match code). Suites: 1376 passed, 2 skipped. * review(tools): R1 -- allowlist container-hardening test in host-path sweep (CI fix) tests/tools/test_gr00t_container_hardening.py contains /home/<user>/ paths as test DATA (asserting the production code REJECTS these paths). The test_no_host_paths.py sweep correctly catches these literals, but the file is a security test that must carry protected-path samples. Add it to ALLOWED_FILES alongside the existing path-validation allowlist entry which has the same rationale. * fix(mesh): use real global flag for HITL opt-out once-warning CodeQL py/unused-global-variable flagged _NONE_OPT_OUT_WARNED because the one-element-list flag was only ever subscript-mutated, never re-bound, so the module-level name binding looked write-only to the analysis. Replace the list-hack with a real module bool plus a _warn_none_opt_out_once() helper that owns the global rebind, making the use statically visible. The list form provided no atomicity anyway; a benign log-once race under concurrency is acceptable for a one-shot warning. _reset_interrupt_actions_cache now also clears the flag so test isolation is unchanged. * test(gr00t): load module via import_module to satisfy py/import-and-import-from CodeQL py/import-and-import-from flagged the plain 'import strands_robots.tools.gr00t_inference as gi' because the same module is imported via 'from ...gr00t_inference import gr00t_inference' elsewhere in the package (a TYPE_CHECKING stub in strands_robots/__init__.py). The obvious 'from strands_robots.tools import gr00t_inference as gi' does not work: the tools package lazy __getattr__ resolves that name to the tool function (a DecoratedFunctionTool), not the module, so it lacks subprocess / _is_allowed_image / _start_container that these white-box tests need (incl. patch.object(gi.subprocess, ...)). importlib.import_module returns the canonical sys.modules entry without a plain dotted import statement, and is the package's own module-loading idiom (both __init__.py files use it). gi stays bound to the module object, so no other lines change. * fix(robot_mesh): eliminate module-global warn-once flag to satisfy CodeQL CodeQL py/unused-global-variable kept firing on _NONE_OPT_OUT_WARNED even after the R2 module-bool rewrite (alerts 339/220 fired post-2a2faa9). The analyzer does not recognize the conditional rebind inside the warn-once guard as a use of the global. Replace the module-global bool + manual guard with a functools.lru_cache (maxsize=1) nullary function: first call logs, subsequent calls return the memoized None without re-logging. No module-level mutable state remains, so the alert has no name to flag. cache_clear() in _reset_interrupt_actions_cache preserves test isolation. Behaviour (warn-once per process) is unchanged. All 40 HITL/security/subscribe-allowlist tests pass; ruff/mypy clean. * review(gr00t): R3 -- validate hf_local_dir against volume blocklist (addresses thread gr00t_inference.py:486) * review(gr00t): R4 -- validate hf_local_dir at dispatch boundary to close download host-fs sink hf_local_dir reaches two host-fs sinks but only the bind-mount sink was guarded: _download_checkpoint writes the HF snapshot to it directly on the host (no docker mediation), and lifecycle=full downloads before it starts the container, so action=download_checkpoint / action=lifecycle bypassed the R3 _start_container guard entirely. Hoist validation to the agent dispatch boundary via a shared _check_hf_local_dir_safety helper (reuses the expand + prefix-match blocklist), so every action rejects a prompt-injected path once at the untrusted-input entry point. Keep defence-in-depth guards at the _download_checkpoint and _start_container internal entry points for callers that bypass the tool. Pin: tests/tools/test_gr00t_container_hardening.py::TestHfLocalDirDownloadSafety (download/lifecycle dispatch + internal entry + helper; 16 cases, all fail on pre-fix code -- host writes/docker reached). * review(mesh): R5 -- apply subscribe allowlist to watch action (closes telemetry-exfiltration parallel channel) * security(gr00t): remove agent-controlled build source repo/tag/dir build_image / lifecycle clone a git repo and run `bash docker/build.sh` from it. With repo_url / repo_tag / source_dir on the agent-facing tool signature, a prompt-injected agent could clone an attacker tree and execute its build script -- host RCE under the agent UID -- bypassing the hardening that already removed image_name / volumes / container_command for the same reason. Mirror the established operator-config + allowlist pattern (the one used for the container image): - Drop repo_url / repo_tag / source_dir from the @tool signature (one-way door: done before 0.4.0 ships these as public params). - Resolve the clone URL/tag from operator env (STRANDS_GR00T_REPO_URL / STRANDS_GR00T_REPO_TAG), validated at the dispatch boundary before any action branch forwards them; fail closed on a misconfigured env. - URL allowlist is EXACT match (defaults to the canonical NVIDIA repo, with/without .git; extend via STRANDS_GR00T_REPO_URL_ALLOW). A trailing wildcard like the image allowlist would let `...Isaac-GR00T-evil` slip past, so URLs are matched literally. - repo_tag is shape-validated (`^[A-Za-z0-9][A-Za-z0-9._/-]*$`) so it cannot be read by git as an option (e.g. --upload-pack) or carry shell metacharacters into the clone/checkout. - Reject a leading `-` on the URL (git argument injection). - source_dir is removed entirely; the clone destination is fixed to _isaac_gr00t_dir() (operator base dir), never caller-supplied. - Defence in depth: _build_image re-asserts the URL allowlist + tag-shape guard, protecting the internal/operator/test entry point too. Regression pins (tests/tools/test_gr00t_container_hardening.py): - TestBuildSourceNotAgentControllable: signature drops the three params; passing any of them raises TypeError; off-allowlist URL and unsafe tag fail closed for build_image and lifecycle before any git/bash runs; internal _build_image entry is guarded; the default URL clones to the fixed dir; operators can extend the URL allowlist. - test_no_agent_controlled_host_exec_or_path_params: structural invariant pinning that NO host-exec/mount-shaped param (repo_url, repo_tag, source_dir, image_name, volumes, container_command) is ever reintroduced to the agent surface -- closes the vulnerability class, not just this instance. Updated tests/tools/test_gr00t_inference.py TestBuildImage and the _lifecycle default kwargs to the new contract (allowlisted URL, no source_dir, clone dest via patched _isaac_gr00t_dir). * harden: close 3 pentest findings on PR strands-labs#372 PoC-driven follow-up to the mesh+inference hardening PR. Finding A (HIGH): //etc bypassed _check_volume_safety / _check_hf_local_dir_safety os.path.normpath('//etc') == '//etc' on POSIX (preserves leading double slash) but the kernel collapses // to / at lookup, so docker -v //etc:/x mounts /etc. Fix: collapse runs of leading slashes in _normalize_host_path before normpath. Finding B (LOW, defence in depth): _is_allowed_image accepted shell metacharacters in the tag (gr00t:;rm, gr00t:$(rm), etc.) because allowlist match was a plain startswith(prefix). Argv-mode subprocess prevents shell exec, but the value is still attacker-influenceable via STRANDS_GR00T_IMAGE env paste. Fix: gate on [A-Za-z0-9._:/@-]+ charset before allowlist comparison. Finding C (HIGH): _ke_matches treated '**' literally, so the default subscribe allowlist (**/presence, **/health, **/safety/**) matched no real Zenoh keys like 'robot1/presence' or 'a/b/safety/event'. Operators would set STRANDS_MESH_SUBSCRIBE_ALLOW=** to make subscribes work, fully defeating the hardening. Fix: expand leading **/ (zero or more leading segments), trailing /** (prefix or deeper), and both-ends **/X/** (X anywhere). Tests: 44 new regression tests in test_gr00t_pentest_regressions.py covering every PoC vector + legitimate-path/-image/-key positives. All 139 hardening + regression tests pass. * style: ruff format test_gr00t_pentest_regressions.py * harden(R7): reject wildcard targets in robot_mesh watch R7 must-fix #2: 'target' is interpolated into 'strands/{target}/stream' and matched against the subscribe allowlist. If an operator extended the allowlist with a wildcard pattern (e.g. STRANDS_MESH_SUBSCRIBE_ALLOW= 'strands/*/stream' per the README example and the test_watch_allowed_when_operator_extends_allowlist test), then the agent could call watch(target='*') or watch(target='**') and the wildcard target passed the allowlist match by literal equality / trailing-/** branch -- reaching mesh.on_stream('*') and subscribing to every peer's stream (the cross-peer telemetry-leak this surface exists to close). Fix: validate 'target' as a literal peer id BEFORE interpolating into the watch_key, using a regex pattern (alphanumerics + ._-, max 64 chars, first char alphanumeric). Mirrors the _REPO_TAG_RE shape-validation pattern in gr00t_inference.py for the same class of attack. The HITL bypass exception is intentionally NOT applied to the shape check -- otherwise the operator could be tricked into approving a single-peer- shaped reason string while the agent slipped a wildcard past. Pin tests (tests/mesh/test_robot_mesh_subscribe_allowlist.py): - test_watch_rejects_wildcard_targets_even_with_permissive_allowlist parametric across [*, **, */state, peer-a/state, ../etc, ''] each with STRANDS_MESH_SUBSCRIBE_ALLOW=strands/*/stream - test_watch_rejects_wildcard_even_when_in_hitl_set_and_approved proves HITL approval can't legitimise wildcard - test_watch_accepts_literal_peer_ids_with_extended_allowlist parametric across legitimate ids (peer-b, robot_1, robot.local, etc.) - test_watch_rejects_overlong_peer_id (64-char boundary) - test_watch_rejects_leading_special_char (-rm, .dot, _under, /abs) R7 must-fix #1 (//etc bypass via os.path.normpath POSIX double-slash preservation): already closed in commit 4cdbfa8 -- _normalize_host_path collapses runs of leading slashes BEFORE normpath, and pin tests in tests/tools/test_gr00t_pentest_regressions.py cover //etc, //etc/shadow, ///etc, etc. The R7 review thread on that surface was filed against stale code (pre-4cdbfa8); no additional change required. Tests: 155 passed (subscribe_allowlist + pentest_regressions + container_hardening), full mesh suite 1355 passed / 2 skipped. ruff check clean, ruff format clean. Per AGENTS.md > Review Learnings (PR strands-labs#92) > 'LLM Input Safety': allowlist enumerable values; reject shell metacharacters in paths. * test: allowlist pentest-regression file in host-path hygiene guard The R7 pentest regression suite (test_gr00t_pentest_regressions.py) feeds protected host paths as attack input -- including //home/user/.aws/credentials as a leading-double-slash bypass vector -- and asserts the production guard REJECTS each one. The repo-hygiene sweep test_no_host_specific_absolute_paths matched that literal against its /home/<name>/ pattern and failed CI, the same class of false positive the suite already allowlists test_gr00t_container_ hardening.py for (security test data that proves rejection, not host coupling). Add test_gr00t_pentest_regressions.py to ALLOWED_FILES with the same rationale. No production code touched; the attack-input fixtures stay intact so the double-slash and protected-path regressions keep their teeth. Verified: - tests/test_no_host_paths.py: 1 passed (was failing on head) - tests/tools/test_gr00t_pentest_regressions.py: 44 passed - ruff check + format clean --------- Co-authored-by: strands-agent <217235299+strands-agent@users.noreply.github.com> Co-authored-by: strands-robots <strands-robots@users.noreply.github.com> Co-authored-by: strands-coder <agent@strands.local> Co-authored-by: cagataycali <cagataycali@users.noreply.github.com> Co-authored-by: DevDuck <devduck@cagatay.local>

kavya-chennoju merged commit eae4c1f into kavya-chennoju:feat/device-connect-on-main Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(device-connect): security hardening for the Device Connect integration#2

fix(device-connect): security hardening for the Device Connect integration#2
kavya-chennoju merged 1 commit into
kavya-chennoju:feat/device-connect-on-mainfrom
cagataycali:feat/device-connect-security-hardening

cagataycali commented Jun 11, 2026

Uh oh!

kavya-chennoju commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cagataycali commented Jun 11, 2026

Findings addressed

Changes

Tests

Note for @kavya-chennoju

Uh oh!

kavya-chennoju commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was wrong

The fix (3 layers)

How to pull it in

Heads-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kavya-chennoju commented Jun 12, 2026 •

edited

Loading