Branch: investigate/cloudflare-workers-compat
Date: 2026-05-12
Question: Can the Pilot Protocol SDK run inside a Cloudflare Worker with nodejs_compat?
The current Node SDK cannot run on Workers, period. Two non-negotiable blockers:
- The SDK loads
libpilot.dylib/so/dllviakoffi(FFI). Workers do not support native FFI / N-API. There is no path to load a Go-compiled shared library. - The pilot protocol uses UDP. Workers'
node:dgramis documented as "partially supported (non-functional)" — present as an importable stub, but does nothing. There is no raw UDP transport available in the Workers runtime.
A new "Workers-compatible" mode is buildable, but it is a new architecture, not a port of the existing SDK. See Architecture options below.
Workers / nodejs_compat capability |
Pilot needs it for | Status |
|---|---|---|
node:dgram (UDP) |
the entire wire protocol | ❌ non-functional stub |
node:net (raw TCP) |
could replace UDP via a tunnel | ✅ supported |
node:fs |
identity.json, trust.json, config | ✅ supported (ephemeral; pair with KV/D1/R2 for persistence) |
| Native FFI / N-API | koffi → libpilot.so |
❌ not supported |
child_process |
spawning the daemon | ❌ non-functional stub |
| WebSocket client | tunnelled relay | ✅ supported via fetch upgrade |
crypto (Web Crypto + node:crypto subset) |
X25519, Ed25519, AES-256-GCM | ✅ supported — but X25519 is via SubtleCrypto only |
Outbound connect() TCP |
tunnel transport | ✅ supported (port 25 blocked, Cloudflare IPs blocked, no localhost / private IPs) |
Source: https://developers.cloudflare.com/workers/runtime-apis/nodejs/, https://developers.cloudflare.com/workers/runtime-apis/tcp-sockets/.
The Node SDK is structured as:
sdk/node/src/
├── ffi.ts ← koffi loads libpilot.{so,dylib,dll}
├── client.ts ← Driver wrapper, marshals JSON over FFI
├── runtime.ts ← library discovery + lifecycle
└── cli.ts ← pilotctl shim
libpilot is built from sdk/cgo/bindings.go — it embeds the full Go daemon's driver (the unix-socket client side that talks to a running pilot-daemon over the IPC socket).
In a Worker, every link of that chain fails:
koffi.load('libpilot.dylib')→ no FFI runtime- Even if the library loaded, it would
unix.Dial('/tmp/pilot.sock')→ no Unix sockets and no daemon process - Even if the daemon ran, it binds a UDP socket → no UDP
So the SDK is not partially incompatible — it is foundationally incompatible.
Run a regular daemon on conventional infra (VM, k8s pod, Fly.io, etc.) with full UDP access. Expose a TCP-reachable bridge: HTTP for request/reply operations (/data, info, lookup) + WebSocket for streaming (recvFrom, events, tunneled conns).
┌────────────────┐ HTTPS/WSS ┌──────────────────┐ UDP ┌────────┐
│ Cloudflare │ ───────────────► │ pilot-bridge │ ────────► │ peer │
│ Worker │ ◄─────────────── │ daemon (+ HTTP) │ ◄──────── │ daemon │
└────────────────┘ └──────────────────┘ └────────┘
Pros:
- Worker becomes a thin client; nothing about its identity is special-cased on the wire (it inherits the bridge's identity, or has its own which the bridge proxies for).
- Shippable in days, not months — the bridge is just an HTTP wrapper around
driver.Driver. - TLS termination at the bridge; auth via a per-Worker bearer token.
Cons:
- The Worker isn't really a Pilot peer — it's piggybacking on a daemon. From the network's perspective, it's the bridge that has identity/trust.
- Extra hop = extra latency (50–200 ms depending on geographic placement).
- Bridge becomes a fan-in chokepoint — needs to scale horizontally with multiple bridges per region.
Effort estimate: ~1–2 weeks. New cmd/pilot-bridge/ binary, sdk/workers/ package, auth model, deployment story.
Build a pure-TypeScript implementation of the Pilot wire protocol (handshake/PILA, AEAD envelope, replay window, KE state). The Worker becomes a real Pilot peer with its own identity. UDP packets are tunneled over a WebSocket connection to a "WS-UDP relay" service.
┌────────────────┐ WSS (UDP-in-frames) ┌──────────────────┐ UDP ┌────────┐
│ Cloudflare │ ─────────────────────────►│ ws-udp-relay │ ────────► │ peer │
│ Worker │ ◄─────────────────────────│ (stateless) │ ◄──────── │ daemon │
│ (Pilot peer) │ └──────────────────┘ └────────┘
└────────────────┘
Pros:
- Worker is a real peer with its own keys, address, and trust state.
- Relay is stateless and trivially horizontally scalable.
- Long-term answer: the relay can later be replaced by direct UDP if Workers ever gains it.
Cons:
- Substantial new TS implementation — the Go protocol code is ~10k LOC, the subset to port is ~2–3k LOC (envelope + handshake + replay-window + key-exchange + ECDH/AEAD via WebCrypto).
- Have to keep the TS implementation in sync with future Go changes (the existing FFI approach has the Go code as single source of truth).
- WebSocket → UDP mapping has its own edge cases (frame ordering, MTU, reconnects, NAT-keepalive equivalent).
- Worker CPU budget (typical 50–100 ms per request, or up to 5 min on "unbound") needs careful pacing for handshakes + retransmits.
Effort estimate: ~6–10 weeks. New sdk/workers-ts/ with pure-TS protocol, new cmd/ws-udp-relay/ for the tunneling endpoint, conformance test suite to keep parity with Go.
The most pragmatic. Expose a REST API on a managed Pilot endpoint (think: pilot.cloud/v1/send etc.). Workers hit the API like any other HTTP service. Auth via API key.
Pros:
- Workers don't need to know Pilot exists. Any client of
fetch()works. - Stateless from the Worker's side.
Cons:
- The user loses the "agent is a network citizen" property — every Worker call is just an HTTP request to a centralized endpoint.
- Conceptually closer to "API gateway" than "Pilot peer" — semantically different product.
Effort estimate: ~3–5 days. The bridge from Option A is most of the work; the API surface is small.
The middle path. Compile the pilot protocol's pure logic (crypto, replay window, envelope packing/unpacking, key-exchange state machine, replay-recovery gates) to WebAssembly. Keep the transport (UDP→WS, fetch, TLS) in TypeScript using Workers' connect(). The WASM module exports a small surface like:
// In the WASM module (Go):
pilot_handshake_build(...) → bytes
pilot_handshake_handle(state, bytes) → bytes
pilot_envelope_encrypt(state, plain) → bytes
pilot_envelope_decrypt(state, encrypted) → plain | error_code
The TypeScript side owns sockets, timers, retx loops, and feeds bytes into the WASM.
Feasibility check (verified 2026-05-12):
| Constraint | Limit | Pilot fit |
|---|---|---|
| Worker bundle, gzipped | 3 MB free / 10 MB paid | A standard Go WASM is ~10 MB raw, ~3 MB gzipped — right at the free-tier ceiling, fits paid |
| Worker bundle, uncompressed | 64 MB | trivially fits |
| WASI support | "experimental, only some syscalls" | irrelevant — the host-import bridge is used, not WASI |
| Go→WASM target | GOOS=js GOARCH=wasm (browser-shim) or GOOS=wasip1 GOARCH=wasm |
use wasm_exec.js-style shim with custom host imports |
| TinyGo size | ~200–500 KB for the relevant subset | better fit, but TinyGo's incomplete support for goroutines / channels / reflect would force a refactor away from those constructs in the compiled pilot packages |
What CAN be compiled to WASM:
internal/crypto(Ed25519 sign/verify, X25519 derive)pkg/daemon/envelope(AEAD wrap/unwrap, replay window check, ReplayCount + ShouldDropOnReplay)pkg/daemon/keyexchange/crypto.go(key material structs, Salvage, threshold gates)pkg/daemon/keyexchange/handle.go(PILA build / parse — pure function over bytes)pkg/protocol/*(frame encode/decode, addr parse, checksum)
What CANNOT be compiled (or only with refactoring):
pkg/daemon/tunnel.go(usesnet.UDPConn, time.Ticker loops, goroutines for routing — needs to live in TS)pkg/daemon/routing(similar)pkg/daemon/udpio(literal UDP)- Anything using
os.Fileor external processes - The retransmit loop (a TS
setTimeout/Promiseloop calling into the WASM each tick)
Pros:
- Go source remains single source of truth for protocol correctness (handshake bytes, AEAD framing, replay-window bit math).
- The fiddly bits — exact byte layouts, AAD construction, nonce composition — are tested by the existing Go test suite and ship via WASM, not retyped in TS.
- Vastly smaller TS surface than Option B (just the I/O loops + driver-style API).
- Future Go-side protocol changes auto-flow into WASM with
make wasmrebuild.
Cons:
- Two languages for one library — debugging spans both.
- Goroutine→TS event-loop bridging is non-trivial. The WASM module must be designed so all its "callbacks" surface as deterministic return values rather than spawning goroutines.
- Standard Go's WASM runtime is ~3 MB gzipped — right at the free-tier ceiling. TinyGo would solve that but cuts off goroutines, which the asymmetric-recovery code relies on (background retx). Those would need to be refactored into "step functions" the TS driver calls each tick.
- Need a custom
wasm_exec.jsshim that imports Workers'connect()/crypto.subtle/setTimeoutinto the wasm host. Cloudflare's WASI is experimental and lacks sockets, so it cannot be relied on.
Effort estimate: ~4–6 weeks. Faster than Option B (no protocol re-implementation), slower than Option A (still need the wasm host shim + a redesigned step-function-friendly protocol surface in the Go code).
Cloudflare Workers can import WASM modules as binary assets. The wasm module can be a .wasm file not counted in the Worker bundle if loaded via service binding or Workers AI / R2. This unblocks the size limit but adds a fetch-on-cold-start hit.
If the goal is "agents on Workers can talk to Pilot peers", ship Option A first (HTTP/WebSocket bridge) as a 1–2-week deliverable. The Worker stays a thin client; the bridge holds the keys; the rest of the Pilot fleet sees a normal peer.
If the goal is "Cloudflare Workers become first-class Pilot peers with their own identity" (the more interesting product), Option D (Go→WASM hybrid) is the right long-term answer — single source of truth for protocol correctness stays in Go, but the Worker really is a peer. Plan ~4–6 weeks. If TinyGo's stdlib constraints prove too painful for the recovery loops (goroutines), fall back to Option B (pure-TS port) for ~6–10 weeks at the cost of maintaining a second protocol implementation.
Option C is only attractive if the audience is "developers who want Pilot data but don't care about being agents" — which is a different product.
- Sketch the bridge wire format — likely JSON-RPC 2.0 over HTTPS for unary calls + WebSocket for bidi streams (recv, events).
- Auth: per-bridge-tenant API key; bridge has one Pilot identity, proxies all tenants' calls under that identity.
- Write
sdk/workers/— a TS module that targets the Workersconnect()/fetch()API surface, nonode:fs/node:dgram/koffireferences. - Conformance smoke: a Worker on a real
*.workers.devthat doesbridge.info()andbridge.sendMessage('list-agents', '/data {}')against a real bridge deployment.
- Identity model: does each Worker tenant get its own Pilot node_id (bridge proxies trust handshakes per tenant), or share the bridge's? Affects ~everything downstream.
- Egress cost: high-traffic Workers calling the bridge would generate non-trivial egress between the bridge VM and Cloudflare's network. Likely acceptable; worth modelling.
- WebCrypto X25519 availability: confirmed supported per the Cloudflare docs, but worth a
crypto.subtle.deriveKey(...{name:'X25519'})spike before committing to Option B.