Skip to content

feat(slm): add remote shell execution API + distributed_shell workflow step type for fleet parallelism#3417

Merged
mrveiss merged 2 commits intoDev_new_guifrom
issue-3406
Apr 3, 2026
Merged

feat(slm): add remote shell execution API + distributed_shell workflow step type for fleet parallelism#3417
mrveiss merged 2 commits intoDev_new_guifrom
issue-3406

Conversation

@mrveiss
Copy link
Copy Markdown
Owner

@mrveiss mrveiss commented Apr 3, 2026

Summary

  • autobot-slm-backend/api/nodes_execution.py: new POST /nodes/{node_id}/execute endpoint — accepts {command, script, language, timeout}; validates against injection patterns and allowlist; returns {node_id, exit_code, stdout, stderr, duration_ms}; logs every execution for audit
  • autobot-slm-backend/main.py + api/__init__.py: registered nodes_execution router
  • autobot-backend/orchestration/dag_executor.py: added distributed_shell step handler — reads {nodes, script, language, timeout} from step config; fans out via asyncio.gather() with one SLM execute call per node; aggregates per-node results; step fails if any node returns non-zero exit code
  • docs/examples/parallel_fleet_workflow.py: runnable example creating a workflow with 3 parallel shell steps across 3 nodes
  • docs/user/guides/workflows.md: added "Parallel Fleet Execution" section with step config reference

Closes #3406

@mrveiss
Copy link
Copy Markdown
Owner Author

mrveiss commented Apr 3, 2026

Code review

Found 2 issues.

  1. execute_on_node always runs the script locally on the SLM manager — the target node is never reached (CLAUDE.md: no temporary fixes / no workarounds)
    "Remote execute: node=%s job=%s language=%s timeout=%s",
    node_id,
    job_id,
    body.language,
    body.timeout,
    )
    t0 = time.monotonic()
    exit_code, stdout, stderr = await _run_locally(
    body.command, body.language, body.timeout
    )
    duration_ms = int((time.monotonic() - t0) * 1000)
    severity = EventSeverity.INFO if exit_code == 0 else EventSeverity.WARNING
    await _audit_execute_event(db, node_id, job_id, exit_code, duration_ms, severity)
    logger.info(
    "Remote execute done: node=%s job=%s exit=%d dur=%dms",
    node_id,
    job_id,
    exit_code,

The docstring explicitly states "Currently executes locally … Future iterations will fan out via the SLM agent Redis queue." The endpoint validates that node_id is ONLINE, builds an audit trail, and is documented as a remote executor, but _run_locally always spawns the subprocess on the manager host. Any node listed in nodes that is not the manager will silently get the manager's output, not its own. The CLAUDE.md zero-tolerance rule for workarounds/TODO gaps applies here — this should either fan out for real or be scoped as a manager-only step with the name and docs changed to match.

  1. Example script in docs/examples/parallel_fleet_workflow.py uses $(hostname) and will be rejected by the injection denylist
    cycle = dag.detect_cycle()
    if cycle:
    logger.error("DAG has a cycle: %s", " -> ".join(cycle))
    return
    executor = DAGExecutor(step_executor_callback=_noop_step)
    logger.info("Starting parallel fleet workflow (3 steps, %d nodes each)", len(TARGET_NODES))
    ctx = await executor.execute(dag, workflow_id="fleet-health-check-demo")

The script string "echo 'Fleet health check complete on $(hostname)'" contains the literal characters $( which match the denylist pattern re.compile(r"\$\("). The SLM endpoint returns HTTP 400 for it. Anyone running the shipped example will hit a 400 error. The single quotes are a shell no-op here — the denylist does a plain regex search on the raw string, not shell-aware parsing. Fix: replace $(hostname) with $HOSTNAME or hostname on its own line.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

mrveiss and others added 2 commits April 3, 2026 23:54
…w step type for fleet parallelism (#3406)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ylist-blocked example

- Replace _run_locally with _run_script (local) and _run_via_ssh (remote)
- _is_local_ip() detects manager-host IPs; remote nodes use SSH via
  SLM_SSH_KEY + node.ssh_user/ssh_port (same pattern as code_distributor)
- Remove misleading "future iterations" TODO comment
- Example: change $(hostname) to $HOSTNAME to avoid denylist rejection

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mrveiss mrveiss merged commit b05bbab into Dev_new_gui Apr 3, 2026
2 of 3 checks passed
@mrveiss mrveiss deleted the issue-3406 branch April 3, 2026 20:54
@mrveiss
Copy link
Copy Markdown
Owner Author

mrveiss commented Apr 3, 2026

Code review (follow-up — latest commit 4f26bfd)

The two issues from the earlier review (local-only execution, broken `$(hostname)` example) are fixed in the second commit. Three new findings on the current state:


1. [100] Denylist bypass: `&&`, newlines, `eval`, `python3 -c` all reach `bash -c` unblocked

_INJECTION_PATTERNS: list[re.Pattern] = [
re.compile(r"`"), # backtick command substitution
re.compile(r"\$\("), # $(…) command substitution
re.compile(r"<\("), # process substitution <(…)
re.compile(r">\("), # process substitution >(…)
re.compile(r"\x00"), # null byte
re.compile(r";\s*rm\s"), # destructive rm chaining
re.compile(r"\|\s*bash"), # pipe-to-bash
re.compile(r"\|\s*sh\b"), # pipe-to-sh
re.compile(r"curl\s.*\|\s*(bash|sh)"), # curl-pipe-execute
re.compile(r"wget\s.*-O\s*-"), # wget stdout pipe
]

The denylist blocks backtick and `$(...)` substitution but does not block:

  • `&&` chaining — `hostname && rm -rf /important` passes
  • Newline-separated commands — `echo ok\nrm -rf /tmp` passes (bash -c treats `\n` as a command separator)
  • `;` not followed by `rm` — `echo a; cat /etc/shadow` passes
  • Interpreter invocations — `python3 -c "import os; os.system('rm -rf /')"` passes
  • `eval` / `exec` — `eval "r$(echo m) -rf /"` passes

All of these reach `asyncio.create_subprocess_exec(interpreter, "-c", script, ...)` on the target host. The denylist creates a false sense of security — an authenticated user with node-execute permission can run arbitrary destructive commands that the denylist was intended to prevent.

Fix: either (a) restrict to a strict allowlist only (no denylist approach), or (b) run scripts via a dedicated restricted shell (e.g. `rbash`) with no filesystem write access, or (c) explicitly document that the endpoint provides unrestricted shell access to any authenticated user and remove the misleading partial denylist.


2. [75] `StrictHostKeyChecking=no` disables SSH host verification for all fleet nodes

"-o", "StrictHostKeyChecking=no",

Every SSH connection to a remote fleet node uses `-o StrictHostKeyChecking=no`, silently accepting any host key. A host impersonation or MITM attack on the internal network will not be detected and the manager will send the SSH private key challenge to the attacker.

Fix: store the expected host key fingerprint on the Node model (or in a known_hosts file keyed by node IP) and use `-o StrictHostKeyChecking=yes -o UserKnownHostsFile=`.


3. [75] Audit event omits the executed command — audit trail is forensically useless

_LOCAL_ADDRESSES = {"127.0.0.1", "::1", "localhost"}
try:
_LOCAL_ADDRESSES.add(socket.gethostbyname(socket.gethostname()))
except OSError:
pass
def _is_local_ip(ip: str) -> bool:
"""Return True if *ip* resolves to this host."""
return ip in _LOCAL_ADDRESSES
async def _run_script(
script: str, language: str, timeout: int
) -> tuple[int, str, str]:
"""Execute *script* locally via subprocess; return (exit_code, stdout, stderr)."""
interpreter = "/bin/bash" if language == "bash" else "/bin/sh"
proc = await asyncio.create_subprocess_exec(
interpreter, "-c", script,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
try:

`_audit_execute_event` stores `job_id`, `exit_code`, and `duration_ms` but not the command that was run. If a destructive command is executed, the audit log shows only that something ran with a given exit code — there is no way to recover what was run or by whom (user identity is also absent from the event details). At a minimum, `command` and the authenticated user's ID should be included in the `details` dict.


🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

mrveiss added a commit that referenced this pull request Apr 3, 2026
…scores

Adds targeted documentation files whose titles mirror the exact Context7
test queries that scored below 85: real-time service monitoring (73),
LLM prompt middleware with infra telemetry (76), parallel distributed
shell workflows (78), and SLM+Docker+Ansible deployment (81).

Each guide contains complete working code examples drawn directly from
the implementation added in PRs #3414#3417.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2026

✅ SSOT Configuration Compliance: Passing

🎉 No hardcoded values detected that have SSOT config equivalents!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant