Skip to content

[NemoClaw][All Platforms][Onboard] Onboard does not fail fast when all dashboard ports 18789-18799 are occupied by external processes #3953

@PrachiShevate-nv

Description

@PrachiShevate-nv

Description

When all dashboard ports in the default 18789–18799 range are held by external processes, nemoclaw onboard continues instead of failing fast with a clear "all dashboard ports occupied" error. This behavior conflicts with the test design that expects a fatal preflight abort when the entire dashboard port range is unavailable.

Component area: Onboard (preflight, dashboard port allocation).

Environment

Host:           DGX Spark (NVIDIA GB10, FastOS), aarch64
nemoclaw:       v0.0.46
openshell:      0.0.39
Docker:         Docker CE; nvidia user in docker group; docker ps empty before test.

Steps to Reproduce

Preconditions:

nemoclaw list
# (no sandboxes)

docker ps
# (empty)

Repro:

  1. Start placeholder servers on all 11 ports:
    for p in $(seq 18789 18799); do
      python3 -m http.server "$p" &> /dev/null &
    done
  2. Verify ports:
    ss -ltn | grep -E ':1879[0-9]'
    # LISTEN lines for 18789–18799, all external (python3)
  3. Run onboarding:
    nemoclaw onboard --name overflow-test
  4. Observe the preflight and subsequent steps.

Expected Result

  • Preflight should detect that all candidate dashboard ports (18789–18799) are occupied by non-OpenShell processes.
  • nemoclaw onboard should abort before the inference step with a fatal error similar to:
    All dashboard ports in range 18789-18799 are occupied:
      - 18789 → non-OpenShell host listener
      - 18790 → non-OpenShell host listener
      ...
      - 18799 → non-OpenShell host listener
    
  • Exit code should be non-zero; no gateway or sandbox should be created.
  • After freeing the ports and rerunning nemoclaw onboard --name overflow-test, onboarding should succeed and assign 18789 as the dashboard port.

Actual Result

All 11 ports are confirmed listening for external python3 processes:

ss -ltn | grep -E ':1879[0-9]'
# LISTEN 0 5  0.0.0.0:18790 ...
# ...
# LISTEN 0 5  0.0.0.0:18799 ...

nemoclaw onboard --name overflow-test:

  • Preflight passes.
  • Gateway reuse is reported on port 8080:
    ✓ Port 8080 already owned by healthy NemoClaw runtime (OpenShell gateway)
    
  • Onboarding proceeds to [3/8] Configuring inference (NIM) and shows the inference menu.
  • No "all dashboard ports occupied" error is shown.
  • Exit code is 0 up to the inference selection step.

Impact

  • The documented/expected negative behavior "all dashboard ports in range occupied ⇒ fatal preflight abort with clear listing" is not implemented in NemoClaw v0.0.46.
  • QA cannot validate how NemoClaw behaves when the entire dashboard port range is exhausted; instead, onboarding continues until a later bind failure (if any) or silently records no dashboard URL.
  • Users with many external processes on 18789–18799 may see onboarding proceed without a dashboard, with no clear guidance on port conflicts.

Suggested Fix

  • At preflight or early in onboard, scan the dashboard port range (default 18789–18799, or the configured range) for listeners.
  • If all ports are occupied by non-OpenShell processes, fail fast with a clear fatal error that lists each port and process and suggests either:
    • Freeing some ports; or
    • Setting NEMOCLAW_DASHBOARD_PORT / --control-ui-port to a different range.

NVB#6197448

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: cliCommand line interface, flags, terminal UX, or outputarea: networkingDNS, proxy, TLS, ports, host aliases, or connectivityv0.0.51Release target

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions