perf: investigate and reduce networking latency during onboard and validation

## Problem

Multiple PRs have independently worked around slow networking by bumping timeouts, adding retries, and inserting sleep delays. Each fix addresses a symptom, but the cumulative effect is a fragile onboard experience that breaks on slower hardware and adds minutes of wall-clock time on every platform.

Examples:
- #1998 / #987: WSL2 inference validation timeouts doubled from 10s/15s → 20s/30s because DNS + TLS through the virtualized network stack is slow
- ARM64 gateway health polls: 5 polls × 2s → 30 polls × 10s (5 minutes) because k3s init takes longer
- WSL2 sandbox ready wait: 30 × 2s = 60s polling loop because pod init is slower under Docker Desktop
- Ollama model pull: 10-minute timeout
- Gateway start: exponential backoff with p-retry (10s min, factor 3)
- 30+ explicit `sleep()` calls scattered across onboard, recovery, and validation paths

As @BrandonPelfrey noted in #1998: "Across multiple PRs from different folks I'm seeing things like add 10 seconds here and there so $thing doesn't time out. Want to make sure we're questioning why networking *seems* to be so fiddly." These timeout bumps are fragile — they work on the machines that were tested but will break on slower hardware.

## Scope

This is a diagnostic and optimization effort, not a single bug fix. The goal is to understand *why* networking is slow and fix root causes rather than continuing to widen timeouts.

### Phase 1: Diagnose

- [ ] Profile the onboard network path end-to-end on native Linux, macOS, and WSL2: where is time actually spent?
- [ ] Measure DNS resolution latency inside the k3s container vs. on the host — is CoreDNS adding overhead?
- [ ] Measure TLS handshake time for inference provider endpoints from inside the sandbox vs. from the host
- [ ] Determine whether the L7 proxy (gateway) adds measurable latency to inference validation probes
- [ ] Check if `host.openshell.internal` resolution is slow on WSL2 (goes through Windows DNS?)
- [ ] Profile the `sleep()` calls — which are covering real async settling vs. papering over race conditions?

### Phase 2: Optimize

Based on diagnosis, potential fixes (non-exhaustive):

- **DNS caching/preflight**: Pre-resolve and cache provider DNS during onboard before validation probes hit the gateway
- **Connection reuse**: Validation probes currently spawn a new `curl` per attempt — a persistent connection (or at least keepalive) would skip repeated TCP+TLS handshakes
- **Parallel health checks**: Some sequential polling loops could overlap (e.g., gateway health + sandbox ready + dashboard ready)
- **Reduce gateway round-trips**: Validation probes go host → gateway → provider. If the gateway adds overhead, consider a direct probe option for validation only
- **Replace sleeps with event-driven waits**: Many `sleep(2)` calls are waiting for a process or pod state — replace with `kubectl wait`, readiness probes, or file watches where possible
- **Platform-aware defaults**: Instead of doubling timeouts for WSL2 as a special case, consider adaptive timeouts that measure the first probe latency and scale subsequent timeouts accordingly

### Phase 3: Harden

- [ ] Add onboard timing telemetry (opt-in) so we can see real-world latency distributions
- [ ] Set a performance budget: onboard on a warm system with cached images should complete in < N seconds
- [ ] Add CI timing regression tests that fail if onboard wall-clock exceeds the budget

## Current timeout inventory

| Location | Current value | Why |
|----------|--------------|-----|
| Validation probe (`onboard.ts`) | 10s connect / 15s total (20/30 on WSL2) | Inference provider reachability |
| HTTP probe default (`http-probe.ts`) | 10s connect / 60s total | Streaming inference responses |
| Local provider health (`local-inference.ts`) | 3s connect / 5s total | Ollama/vLLM liveness |
| Gateway liveness (`nemoclaw.ts`) | 3s max | Quick gateway up/down check |
| ARM64 health poll | 30 × 10s = 300s | k3s slow init on ARM |
| WSL2 sandbox ready | 30 × 2s = 60s | Pod init under Docker Desktop |
| Ollama model pull | 600s (10 min) | Large model download |
| OpenShell install | 300s (5 min) | Binary download + extract |

## References

- #1998 — WSL2 timeout widening (trigger for this investigation)
- #987 — Original WSL2 inference verification timeout issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: investigate and reduce networking latency during onboard and validation #2001

Problem

Scope

Phase 1: Diagnose

Phase 2: Optimize

Phase 3: Harden

Current timeout inventory

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Location	Current value	Why
Validation probe (`onboard.ts`)	10s connect / 15s total (20/30 on WSL2)	Inference provider reachability
HTTP probe default (`http-probe.ts`)	10s connect / 60s total	Streaming inference responses
Local provider health (`local-inference.ts`)	3s connect / 5s total	Ollama/vLLM liveness
Gateway liveness (`nemoclaw.ts`)	3s max	Quick gateway up/down check
ARM64 health poll	30 × 10s = 300s	k3s slow init on ARM
WSL2 sandbox ready	30 × 2s = 60s	Pod init under Docker Desktop
Ollama model pull	600s (10 min)	Large model download
OpenShell install	300s (5 min)	Binary download + extract

perf: investigate and reduce networking latency during onboard and validation #2001

Description

Problem

Scope

Phase 1: Diagnose

Phase 2: Optimize

Phase 3: Harden

Current timeout inventory

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions