Status: Shipped to production 2026-05-18. Live at wss://beacon.pilotprotocol.network/v1/compat.
Scope: new transport for pilot-daemon that tunnels Pilot packets over WebSocket Secure to the beacon, so daemons in UDP-blocked environments (Docker on Render/Railway/Vercel/Fly/Lambda, restrictive corp networks) can still join the overlay.
Issue: addresses the "HTTP gateway" ask in Garry Tan's 2026-05-16 bug report.
The draft below leaned toward Caddy + an embedded Pilot CA root from day one. Reality on pilot-rendezvous-new made the simpler path strictly better, so a few decisions changed during the rollout:
- TLS terminator: nginx, not Caddy. The production host already runs nginx 1.22 with certbot for
console.pilotprotocol.networkandpolo.pilotprotocol.network. Adding a third server block forbeacon.pilotprotocol.networkreused the existing TLS automation. Caddy is no longer planned. - Cert: Let's Encrypt, not the Pilot root CA — for now. The Pilot CA tooling (
cmd/pilot-ca,internal/transport/compat/roots/) ships in the binary but the embedded root is the dev root (dev-2026.pem). Production currently uses a Let's Encrypt leaf. A future release will mint the prod root, embed it in client binaries, and flip the daemon's-tls-trustdefault back topinned. - Daemon
-tls-trustdefault:system, notpinned. Because the production beacon uses Let's Encrypt today,pinned(which would only trust the embedded Pilot root) would refuse every connection. Default issystemuntil the production root ships. - Beacon binary:
cmd/rendezvous, notcmd/beacon. Production runs the combinedpilot-rendezvous(registry + beacon in one process). The WSS bridge is wired via a new-wss-addrflag plumbed topkg/beacon.Server.EnableCompatWSS. The pubkey resolver reads from the in-process registry'sLookupPublicKey. - Rollout phases 1-6 (and the originally-deferred relay-worker integration) all shipped together. See the "Rollout — what shipped" section at the bottom.
The draft text below is preserved for design context. Where reality diverges, the section above is authoritative.
Today every Pilot daemon must bind a public UDP socket (directly or via beacon hole-punch). On modern container PaaS (Render, Railway, Vercel, Lambda) UDP is impractical: either the platform doesn't expose UDP ports at all, or the symmetric NAT defeats hole-punching. Garry Tan's bug report explicitly calls this out — the catalogue is the killer feature, the UDP transport is the barrier.
Compat mode is a second transport for the daemon: instead of binding a UDP socket, the daemon opens a long-lived WebSocket Secure connection to the beacon on TCP port 443. Each WS binary frame carries exactly one Pilot packet, both directions. The beacon's existing relay machinery shuttles those packets to/from UDP peers.
End-to-end Ed25519 trust is unchanged. TLS provides the encrypted channel and server-auth (the daemon knows it's talking to a real beacon); Ed25519 provides peer-auth (the specialist knows the caller is who they say they are). Compat mode is a wire change, not a trust-model change.
- A daemon in a UDP-blocked Docker container can join the Pilot overlay and roundtrip queries against any specialist.
- UDP daemons (today's specialists, today's clients) need zero code changes to talk to compat-mode peers.
- The 4-cell transport matrix all works:
- UDP ↔ UDP — unchanged
- UDP ↔ WSS — beacon translates
- WSS ↔ UDP — beacon translates
- WSS ↔ WSS — beacon shuttles between two WS conns
- Compat mode is opt-in (CLI flag) for the first release; later auto-fallback after N seconds of UDP failure.
- TLS pinning to a Pilot-controlled root CA. Standard PKI compromise of public CAs cannot MITM compat daemons.
- Works through ~all commercial firewalls. Documented escape hatch for TLS-intercepting corp proxies.
- Direct browser/curl access to specialists. (That's a separate centralized "HTTPS gateway" service — out of scope here; can be built on top of compat mode later.)
- Domain-fronting / ECH / GFW-bypass. State-level censorship resistance is a separate, much larger project.
- Removing UDP. UDP remains the primary transport. Compat mode is an alternative for hosts that can't use it.
| Source | Destination | Path |
|---|---|---|
| UDP daemon | UDP daemon (direct) | unchanged — direct UDP, or beacon UDP-relay if NAT-bad |
| UDP daemon | WSS daemon (compat) | UDP packets travel to beacon (as UDP relay); beacon writes them out as WS frames on the compat peer's conn |
| WSS daemon (compat) | UDP daemon | compat daemon writes WS frames to beacon; beacon writes them out as UDP relay packets to the destination |
| WSS daemon | WSS daemon | both terminate WSS at the beacon; beacon shuttles WS frames from one conn to the other |
The beacon is the universal hub. UDP peers don't know whether a remote is on UDP or WSS — they see relay_only=true in the registry and route via beacon as they do today for symmetric-NAT peers.
- A dedicated root CA keypair is minted offline (e.g. on a Yubikey).
- The root CA's PEM-encoded certificate is embedded in
cmd/daemonvia//go:embedso every daemon binary ships with the trust anchor pre-pinned. - The CA signs leaf certs for each beacon hostname (
beacon-us.pilotprotocol.network,beacon-eu.…, etc.). - Leaf certs rotate via standard
tls.Config.GetCertificate. Root rotation is a multi-release event handled by shipping the new root in a daemon update alongside the old one (overlap window).
- Beacon WSS listener on port 443 (or 8443 behind a reverse proxy).
- Production setup (shipped 2026-05-18): nginx 1.22 on
pilot-rendezvous-newterminates TLS on :443 forbeacon.pilotprotocol.networkvia a Let's Encrypt cert (certbot, auto-renewing). It reverse-proxies WebSocket upgrades to127.0.0.1:18443where the rendezvous binary's WSS bridge listens. The original draft assumed Caddy + a private Pilot CA root; reusing the host's existing nginx+certbot stack was strictly simpler. - The daemon-side TLS config sets
RootCAsto aCertPoolcontaining only the embedded Pilot root. System CAs are not trusted by default.
- CLI flag
-tls-trust=pinned|system. Default:systemwhile production uses Let's Encrypt. Will flip back topinnedonce the production Pilot CA root ships embedded in client binaries. pinnedverifies against the Pilot root CA embedded via//go:embed. When the embedded root is the dev placeholder (as today),pinnedrejects every public-CA-signed beacon cert — so users who explicitly pass-tls-trust=pinnedagainst the public beacon today will fail to connect. Intentional, until the production root ships.systemfalls back to the OS trust store. Matches the public beacon's Let's Encrypt cert. Daemon logs a clearWARN: TLS trust relaxed to OS store — TLS-intercepting proxies on the path can read/alter relay traffic; end-to-end Ed25519 still protects payload identity.
- After the WS upgrade succeeds, the beacon sends a challenge as the first server frame:
{"type":"auth_challenge","nonce":"<32 random bytes hex>"} - The daemon replies:
{"type":"auth_reply","node_id":<N>,"public_key":"<base64>","sig":"<base64 Ed25519 over 'compat_auth:'+node_id+':'+nonce>"} - The beacon verifies the signature against the registered pubkey for that nodeID (same lookup as
handleHeartbeat). On success it stores the mappingnodeID → *websocket.Connand responds with{"type":"auth_ok"}. - On failure: 401 close, no retry without backoff.
- All subsequent frames are binary frames containing one raw Pilot packet each. Text frames are reserved for control messages (currently just auth + close + ping/pong).
- One binary WS frame == one raw Pilot packet (including the 34-byte header + payload).
- Maximum frame size: 64 KB (matches Pilot's MTU cap with margin).
- Per-frame overhead: 2-14 bytes WS framing + TLS record overhead. Negligible vs Pilot's 34-byte header.
auth_challenge,auth_reply,auth_ok— see above.bye— graceful close (optional; client may also just close the WS).- Future:
rate_limit_warning,tier_signal, etc.
- Beacon sends a WS ping every 30 seconds to keep idle proxies from culling the connection.
- Daemon must respond with pong within 10 seconds or the beacon closes the conn.
Today pkg/daemon/udpio.Socket owns the UDP FD and exposes Send(frame []byte, dst *net.UDPAddr) error and Recv() (frame []byte, src *net.UDPAddr, err error). That contract is extracted into a daemonio.Transport interface:
type Transport interface {
Send(frame []byte, dst Endpoint) error
Recv() (frame []byte, src Endpoint, err error)
LocalAddr() Endpoint
Close() error
}
// Endpoint is opaque to higher layers — UDP impl returns *net.UDPAddr,
// WSS impl returns a wssEndpoint that wraps the beacon's logical addr.
// Equality is by content (so route-table lookups still work).
type Endpoint interface {
String() string // for logs
Network() string // "udp" | "wss"
}Existing UDP code becomes udpTransport implements Transport — behavior byte-identical, zero risk to today's daemons. New wssTransport is a sibling.
- On
Open(): dial WSS to configured beacon URL using the embedded root for TLS. Perform Ed25519 challenge. On success, spawn a goroutine that read-loops binary frames from the conn into a bufferedrecvCh chan recvFrame. Send(frame, _ Endpoint): write one binary WS frame containingframe. The destination Endpoint is ignored — all writes go to the beacon, which knows from the packet header where to forward it.Recv(): blocks onrecvCh. Returns the next frame withsrc = wssEndpoint{addr: beaconAddr}since from the daemon's perspective, every inbound packet "came from" the beacon. Higher layers parse the Pilot header for the real source nodeID.- On disconnect: signal the daemon's
tunnel.goviarecvCherror, then trigger reconnect with exponential backoff (250ms → 30s cap). - Idle handling: respond to server pings within 10s. The Go
gorilla/websocketlibrary handles this transparently ifSetPingHandleris set correctly.
pilot-daemon \
-transport=udp # today's behavior (default for now)
-transport=compat # WSS-only, forces relay_only=true
-transport=auto # try UDP first, fall back to compat after N=30s
-compat-beacon=wss://... # beacon WSS URL (default: wss://beacon.pilotprotocol.network/v1)
-tls-trust=system # current default while beacon uses Let's Encrypt; pinned will return once prod root ships
Compat mode forces RelayOnly=true on the daemon's registry registration so peers route via beacon.
The L4 tunnel manager today maps nodeID → *net.UDPAddr for peer endpoints. In compat mode the daemon has no peer-specific endpoints — every peer is reached via the beacon. The simplest approach: when running with wssTransport, the tunnel manager treats every peer as "use the single transport endpoint" and skips hole-punch / endpoint-refresh logic entirely. The peer state machine collapses to: handshake → relay → done.
- As shipped: nginx terminates TLS on port 443 for the
beacon.pilotprotocol.networkserver block and reverse-proxies WS upgrades to127.0.0.1:18443, where the rendezvous binary'spkg/beacon/wss.Serveraccepts plain WebSocket connections. - After WS upgrade, beacon issues auth challenge, verifies the daemon's Ed25519 signature against
s.nodes[nodeID].PublicKey(already in registry-shared memory if beacons co-locate with registry; otherwise an RPC to the registry). - On success: store mapping
nodeID → *wssPeer{conn, lastSeenNano, recvCh}ins.wssPeers(alongside today'ss.peersUDP map).
The existing relay handler reads a MsgRelay packet, looks up destination, and writes it via UDP. New behavior:
func (s *Server) routeRelayPacket(dst uint32, frame []byte) {
s.mu.RLock()
if udpAddr, ok := s.peers[dst]; ok {
s.mu.RUnlock()
s.udpConn.WriteToUDP(frame, udpAddr)
return
}
if wp, ok := s.wssPeers[dst]; ok {
s.mu.RUnlock()
wp.writeBinary(frame) // serialized via wp.writeMu
return
}
s.mu.RUnlock()
s.metrics.relayUnknownDest.Inc()
}Inbound from WSS: read goroutine on each *wssPeer reads binary frames and feeds them into the same packet dispatcher the UDP read loop uses (or a thin shim that wraps the frame as if it had arrived via UDP from the daemon's logical address).
- Connection cap per beacon:
MaxWSSPeers = 50000initially (sized so the beacon stays under 8 GB RSS). - Per-source-IP rate-limit on WSS upgrade attempts: 10/sec with a 100-burst.
- Auth challenge times out after 10 seconds — bots that don't sign get dropped.
- WSS idle (no frames + no pong) timeout: 90 seconds.
No changes. relay_only=true already exists today for daemons that want to hide their UDP endpoint from peers. Compat daemons set it; peers see it and route via beacon as they do for symmetric-NAT peers. The mechanism is identical to the existing flow; only the beacon's outbound write path changes when the relay target is a compat peer.
Per-WSS-connection cost on the beacon:
- 1 goroutine for the read loop (~8 KB stack)
- 1 buffered recvCh (~16 KB at 16-frame buffer × 1 KB avg)
- TLS state if terminated in-process (~32 KB) — N/A in shipped config (nginx fronts TLS)
- gorilla/websocket internal buffers (~64 KB)
Estimate: ~120 KB per peer with nginx fronting (similar to the original Caddy estimate). 50k peers → ~6 GB. Current beacon VMs (pilot-rendezvous-new, 16 vCPU) have comfortable headroom.
Memory says current overlay is ~150M req/day, peak ~5k req/sec. If 10% of that becomes compat-mode traffic (500 req/sec), each request now traverses two TCP-and-TLS streams (in + out) instead of UDP. Approximate beacon CPU cost: ~1 vCPU per 5k req/sec of WSS relay. Linear in compat-mode share.
Egress cost is doubled for compat traffic (beacon pays for both legs). Worth modeling against revenue assumptions before flipping -transport=auto to default.
Prometheus metrics on the beacon:
pilot_beacon_wss_connections_total(counter, by outcome: auth_ok / auth_fail / tls_fail / rate_limited)pilot_beacon_wss_active(gauge)pilot_beacon_wss_frames_in_total/_out_total(counter)pilot_beacon_relay_bridge_total{src_transport,dst_transport}(counter; emits one of 4 label combos)pilot_beacon_wss_idle_disconnects_total(counter)
On the daemon:
pilot_daemon_transport(gauge: 1=udp, 2=compat) labelled by hostname.pilot_daemon_wss_reconnects_total(counter, by reason).
Public WSS endpoint can be hit by anyone. Mitigations:
- Per-source-IP rate-limit on upgrade attempts (above).
- nginx in front allows deploying fail2ban / rate-limit modules at the edge.
- Auth challenge requires the attacker to have a valid Pilot identity already registered — bots can't open holding-pattern WSS connections cheaply.
All phases collapsed into a single deploy on 2026-05-18. Production at pilot-rendezvous-new (34.71.57.205, us-central1-a). Each step was independently reversible at the time and remains so via the snapshot artifacts.
cmd/pilot-camints root + leaf certs (subcommands:init-root,issue-beacon,verify).internal/transport/compat/roots/embeds the trust anchor via//go:embed. Currently the dev root (dev-2026.pem); production root not yet minted.- 12 unit tests cover key usage flags, file modes, validity windows, chain integrity.
- Runbook:
docs/RUNBOOK-pilot-ca.md.
pkg/daemon/transport.TransportdefinesSend / Recv / LocalAddr / Close.- Existing UDP code (
pkg/daemon/udpio.Socket) satisfies it implicitly — zero behavior change to UDP daemons. pkg/daemon/transport.ErrClosedshared by both implementations.
pkg/daemon/transport/wss.Transportusesgithub.com/coder/websocketfor the WS client.- CLI:
-transport=udp|compat,-compat-beacon,-tls-trust. All explicit opt-in;udpis default. - Compat mode forces
RelayOnly=trueon registration so peers route via beacon. - 9 unit tests including auth challenge round-trip, rejection paths, idempotent Close.
pkg/beacon/wss.Server— standalone WSS listener with Ed25519 challenge.pkg/beacon.Server.EnableCompatWSS()attaches it to the production beacon.- Tier-0 destination check in
relayWorker: if dest is connected via WSS,WriteFrame()bypasses the UDP sendmmsg batch path. OnFramefeeds inbound WSS frames into the existinghandlePacketdispatch.- Production
cmd/rendezvousexposes-wss-addr 127.0.0.1:18443flag. LookupPublicKey()on the registry resolves nodeID → Ed25519 pubkey for WSS auth.- 10 unit tests including last-writer-wins reconnect and capacity rejection.
tests/compat/covers all 4 cells of the transport matrix via an in-process fake bridge.- 6 integration tests (UDP↔UDP, UDP↔WSS, WSS↔UDP, WSS↔WSS, reconnect routing, unknown-dest drop).
- nginx server block at
/etc/nginx/sites-enabled/beacon.pilotprotocol.networkreverse-proxies WSS upgrades to127.0.0.1:18443. - Let's Encrypt cert (
certbot --nginx, expires 2026-08-17, auto-renewal scheduled). - systemd unit
pilot-rendezvous.serviceextended with-wss-addr 127.0.0.1:18443. Backup atpilot-rendezvous.service.pre-compat. - DNS:
beacon.pilotprotocol.network A 34.71.57.205(TTL 300, no proxy). - Snapshot of previous binary at
/usr/local/bin/pilot-rendezvous.pre-compat-20260518-185845.
- Local compat-only daemon (no UDP socket) connected via
wss://beacon.pilotprotocol.network/v1/compat, registered as nodeID 203986, handshook withlist-agents(0:0000.0002.BBE4) via beacon-relayed handshake, established encrypted tunnel via WSS↔UDP bridge, received the 1709-byte JSON directory.
- Production Pilot CA root. Mint via
pilot-ca init-root(operator action — Yubikey/offline), embed ininternal/transport/compat/roots/alongside the dev root for one overlap release, then delete the dev root. - Flip daemon default
-tls-trustback topinnedin the same release the production root ships. - Prometheus scrape for WSS metrics —
pkg/beacon.Server.WSSMetrics()already exposesUpgradeOK/Fail / AuthOK/Fail / FramesIn/Out / IdleDisconns / ActivePeers. Need a/metricshandler. -transport=automode (try UDP for N seconds, fall back to compat). Currently compat is explicit opt-in.- Multi-beacon WSS connections for redundancy. Today: one WSS conn per daemon; on disconnect, reconnect with exponential backoff to the same URL.
Caddy vs in-process TLS?Resolved: nginx. Production already runs nginx with certbot; adding a third server block reused the existing TLS automation.DefaultDeferred — auto-fallback is a future-release item. Today: explicit opt-in.-transport=autocutover?- Mark compat-mode peers in
list-agentsoutput? Resolved: silent. Leaked deployment posture beats the marginal UX benefit. - WS subprotocol negotiation. Shipped:
Sec-WebSocket-Protocol: pilot.v1set on both sides. - Multi-beacon WSS connections. Deferred — single conn per daemon for v1; multi-beacon redundancy is a future improvement.
- Centralized HTTPS REST gateway. That would be a separate service (Phase 8+) proxying HTTPS REST → Pilot WSS. Easy to add later once compat mode is solid; does not belong in v1 of compat mode itself.
- HTTP/3 / QUIC. The whole point of compat mode is to use TCP/443 which firewalls don't block. QUIC is UDP and would defeat the purpose.
- WebRTC. Considered — the data-channel ICE machinery would provide NAT traversal for free — but WebRTC requires a signaling server and still uses UDP for the data plane. Does not address the underlying constraint.
web/src/pages/docs/networks.astro+plain/mirror — describe transport modes.- New
web/src/pages/docs/firewalls.astro— "running pilot behind a firewall" with the-tls-trust=systemescape hatch. cmd/daemon/main.goflag documentation inconfiguration.astro.README.md— one paragraph in the architecture section.
— end —