Problem
Multiple PRs have independently worked around slow networking by bumping timeouts, adding retries, and inserting sleep delays. Each fix addresses a symptom, but the cumulative effect is a fragile onboard experience that breaks on slower hardware and adds minutes of wall-clock time on every platform.
Examples:
As @brandonpelfrey noted in #1998: "Across multiple PRs from different folks I'm seeing things like add 10 seconds here and there so $thing doesn't time out. Want to make sure we're questioning why networking seems to be so fiddly." These timeout bumps are fragile — they work on the machines that were tested but will break on slower hardware.
Scope
This is a diagnostic and optimization effort, not a single bug fix. The goal is to understand why networking is slow and fix root causes rather than continuing to widen timeouts.
Phase 1: Diagnose
Phase 2: Optimize
Based on diagnosis, potential fixes (non-exhaustive):
- DNS caching/preflight: Pre-resolve and cache provider DNS during onboard before validation probes hit the gateway
- Connection reuse: Validation probes currently spawn a new
curl per attempt — a persistent connection (or at least keepalive) would skip repeated TCP+TLS handshakes
- Parallel health checks: Some sequential polling loops could overlap (e.g., gateway health + sandbox ready + dashboard ready)
- Reduce gateway round-trips: Validation probes go host → gateway → provider. If the gateway adds overhead, consider a direct probe option for validation only
- Replace sleeps with event-driven waits: Many
sleep(2) calls are waiting for a process or pod state — replace with kubectl wait, readiness probes, or file watches where possible
- Platform-aware defaults: Instead of doubling timeouts for WSL2 as a special case, consider adaptive timeouts that measure the first probe latency and scale subsequent timeouts accordingly
Phase 3: Harden
Current timeout inventory
| Location |
Current value |
Why |
Validation probe (onboard.ts) |
10s connect / 15s total (20/30 on WSL2) |
Inference provider reachability |
HTTP probe default (http-probe.ts) |
10s connect / 60s total |
Streaming inference responses |
Local provider health (local-inference.ts) |
3s connect / 5s total |
Ollama/vLLM liveness |
Gateway liveness (nemoclaw.ts) |
3s max |
Quick gateway up/down check |
| ARM64 health poll |
30 × 10s = 300s |
k3s slow init on ARM |
| WSL2 sandbox ready |
30 × 2s = 60s |
Pod init under Docker Desktop |
| Ollama model pull |
600s (10 min) |
Large model download |
| OpenShell install |
300s (5 min) |
Binary download + extract |
References
Problem
Multiple PRs have independently worked around slow networking by bumping timeouts, adding retries, and inserting sleep delays. Each fix addresses a symptom, but the cumulative effect is a fragile onboard experience that breaks on slower hardware and adds minutes of wall-clock time on every platform.
Examples:
sleep()calls scattered across onboard, recovery, and validation pathsAs @brandonpelfrey noted in #1998: "Across multiple PRs from different folks I'm seeing things like add 10 seconds here and there so $thing doesn't time out. Want to make sure we're questioning why networking seems to be so fiddly." These timeout bumps are fragile — they work on the machines that were tested but will break on slower hardware.
Scope
This is a diagnostic and optimization effort, not a single bug fix. The goal is to understand why networking is slow and fix root causes rather than continuing to widen timeouts.
Phase 1: Diagnose
host.openshell.internalresolution is slow on WSL2 (goes through Windows DNS?)sleep()calls — which are covering real async settling vs. papering over race conditions?Phase 2: Optimize
Based on diagnosis, potential fixes (non-exhaustive):
curlper attempt — a persistent connection (or at least keepalive) would skip repeated TCP+TLS handshakessleep(2)calls are waiting for a process or pod state — replace withkubectl wait, readiness probes, or file watches where possiblePhase 3: Harden
Current timeout inventory
onboard.ts)http-probe.ts)local-inference.ts)nemoclaw.ts)References