This document has two halves: an operations section covering credentials we touch daily (SSH, API tokens, GitHub), and a strategy section on hardening the mesh protocol itself. Read the ops section before deploying anything; the strategy section is the long-term roadmap.
Cells are deployed onto Hetzner servers and managed by SSH. Follow these conventions for every key the operator handles:
-
Live at
~/.ssh/with0600on the private key,0644on the.pub, and0700on the directory. The Hetzner SSH key registered asleif-devcorresponds to~/.ssh/id_ed25519.pub. -
Never commit a private key. Not in
openjaws, not in any cell repo, not in.env. There is no scenario where a private key belongs in git. -
Never paste a private key into chat, issues, or PRs. Pubkey only.
-
Add new servers to known_hosts on first connect with
ssh -o StrictHostKeyChecking=accept-new …rather than disabling the check. This pins the host key so future MITMs are caught. -
Use a single named key per role, not one global key. Examples:
~/.ssh/id_ed25519_hetzner_lighthouse,~/.ssh/id_ed25519_hetzner_compute. Reference them in~/.ssh/config:Host lighthouse-* User root IdentityFile ~/.ssh/id_ed25519_hetzner_lighthouse IdentitiesOnly yes -
Rotate on compromise. Generate a new key, register the new pubkey on Hetzner under a fresh name, deploy it to running servers (via cloud-init for new ones; via
ssh-copy-idthen remove the old line from~/.ssh/authorized_keysfor existing), then revoke the old one. -
Per-server
authorized_keysshould be minimal. Don't pile keys "in case." Each line is a standing grant — treat it as one. -
genesis-igniter'slaunchLighthouseandcreateServerMUST attach an existing Hetzner SSH key by ID/name so the operator can SSH in for incident response. Without this, recovery from a stuck cron or broken auto-updater requires destroying the server. The igniter currently does this via the Hetzner API; verify in any new deployer.
- Keep them in
.env, never in code or markdown..envis already gitignored at repo root. - Treat any token that ever appeared in chat, a screenshot, or a doc as compromised — rotate it.
- The Hetzner token currently used (visible in older docs) should be rotated; future docs reference it as
$HETZNER_TOKEN, never inline. - For automation: prefer scoped tokens (Hetzner project tokens are already scoped; GitHub PATs should be fine-grained, repo-scoped).
Cells clone from public github URLs over HTTPS. No credential needed for clone or auto-update of public repos. If a cell ever needs to push from a server (it shouldn't — servers are pull-only consumers), use a deploy key with write disabled by default and enable only for the specific push operation.
Standalone cells commit their bun.lock so the auto-updater installs a deterministic protocol version (see commit c91c7ad on openjaws-lighthouse). Rolling forward a protocol change requires explicit lockfile commits, which gives operators a human-reviewable diff before code lands in production.
Looking at this codebase, you've built a solid foundation—a distributed "cell mesh" with P2P routing, cryptographic identity (Ed25519), narrative tracing, and some deployment automation. But "military-grade robustness" implies a massive leap in resilience, security, observability, and operational discipline.
Here is the critical path to hardening this into something that could survive contested, degraded, and operationally-limited environments:
The Problem: Your mesh relies on a single registry directory (~/.rheo/registry), a single seed node, and file-system state that can be corrupted or locked.
The Fix:
- Distributed Consensus for State: Replace the JSON file registry with a lightweight consensus layer. Integrate Raft or SWIM (Scalable Weakly-consistent Infection-style Process Group Membership) for failure detection and leader election. Don't let one dead node poison the mesh's view of the world.
- Shard the Registry: Use consistent hashing (e.g., jump consistent hash) to distribute cell metadata across the mesh. If a node dies, its registry shard is reconstructed from neighbors, not lost.
- Eliminate the "Seed" Crutch: Implement a gossip protocol with random peer selection and anti-entropy. Nodes should be able to join by contacting any online peer, not just a hardcoded seed.
The Problem: HTTP/1.1 fetch with JSON payloads is slow, verbose, and lacks QoS. In a military context, you need to assume the network is being jammed, intercepted, or partitioned.
The Fix:
- Binary Protocol: Move from JSON to a binary serialization (FlatBuffers, Cap'n Proto, or Protobuf). You need zero-copy parsing and deterministic schema evolution.
- mTLS Everywhere: Every cell-to-cell RPC must be over mutually authenticated TLS. Use SPIFFE/SPIRE for dynamic identity provisioning if you want true military-grade identity management.
- Resilient Transports: Implement QUIC as the base transport. It handles NAT traversal, connection migration (survive IP changes), and has built-in multiplexing. Crucial for mobile/disconnected nodes.
- Mesh Overlay Network: Integrate a layer like Nebula, Tailscale, or a custom WireGuard mesh to create an encrypted overlay that survives underlay failures.
The Problem: Your current error handling is good for debugging, but in a fight, you need the system to degrade gracefully, not just log errors.
The Fix:
- Byzantine Fault Tolerance (BFT): If a node is compromised, it shouldn't be able to inject false atlas entries. Use the Ed25519 keys to sign all atlas updates and verify them before merging. Reject unsigned or invalidly signed gossip.
- Circuit Breakers & Bulkheads: You have the start of this with
failedAddresses, but it needs to be automatic. If a cell fails 3 health checks, isolate it. Stop routing to it. Don't retry indefinitely. - Chaos Engineering: Implement a "Chaos Monkey" cell that randomly kills, partitions, or slows other cells. If your mesh can't survive intentional chaos, it won't survive the enemy.
- Graceful Degradation: Define capability tiers. If
ai/generateis down, can the mesh fall back toai/cached-responseorai/local-model? Build fallback chains into the capability routing.
The Problem: JWT secrets and API tokens are in .env files and plaintext strings. In a military mesh, a single compromised node is a given, not a possibility.
The Fix:
- Zero-Trust Architecture: No cell trusts another by default. Every capability call must be authorized against a policy engine (e.g., Open Policy Agent or a custom capability ACL).
- Secret Lifecycle Management: Integrate HashiCorp Vault or a custom secret cell that uses Shamir's Secret Sharing. No static tokens in code. Ever.
- Capability Attenuation: When a cell delegates a task, it should issue a derived capability token that is valid only for that specific task and time window. This limits blast radius.
- Audit & Non-Repudiation: Your
NarrativeLedgeris excellent. Extend it to be a Merkle DAG or append-only log (like a mini-blockchain). Every action is cryptographically signed and time-stamped. If a node lies, the ledger proves it.
The Problem: You have logs and a latency matrix. Military operations require a Common Operating Picture (COP).
The Fix:
- Distributed Tracing: Integrate OpenTelemetry. Every signal should carry a trace context. You need to see the full path of a request across 50 cells in real-time.
- Health & Telemetry Cell: Expand your telemetry into a full METT-TC (Mission, Enemy, Troops, Terrain, Time, Civilians) equivalent. Track not just latency, but node resource exhaustion, network partition events, and capability saturation.
- Alerting & Auto-Remediation: If a critical cell (like
aiorregistry) dies, the mesh should auto-spawn a replacement on a healthy node. This is where your Hetzner launcher becomes a true orchestrator, not just a provisioner.
The Problem: bun install && bun run index.ts is fine for development. It is not fine for a battlefield.
The Fix:
- Immutable Artifacts: Build OCI (Docker) images or static binaries (compile TypeScript to a single binary via
bun build --compileor Deno). The runtime should be hermetic. - Signed Artifacts: Every cell binary/image must be signed. The orchestrator must verify the signature before spawning.
- Air-Gap Support: The mesh must be able to bootstrap and update from a local artifact repository. You won't have
npmor GitHub in a denied environment. - Hetzner is a Start, but not the End: You need multi-cloud and on-premise abstraction. The launcher should target Kubernetes, Nomad, or bare metal equally. Abstract the infrastructure provider.
If I had to pick one thing to implement next that gives the biggest robustness multiplier, it would be:
Implement a Byzantine-Fault-Tolerant Gossip Protocol with Merkle-ized Atlas State.
Why this first?
- It fixes your registry SPOF.
- It makes your mesh self-healing (nodes can join/leave without central coordination).
- It provides the cryptographic foundation for trust (signed atlas entries).
- It enables everything else (Chaos testing, auto-remediation, secure routing).
Concrete Implementation:
- Replace
~/.rheo/registry/*.jsonwith a gossip-based state machine. - Use SWIM for failure detection (scales to thousands of nodes).
- Use Merkle trees to efficiently sync atlas deltas between peers.
- Sign every atlas entry with the cell's Ed25519 key.
- Add a
mesh/chaoscapability to your test suite that randomly partitions the network.
Once the mesh can survive its own cells being killed and its network being cut, then you layer on the military-specific security and observability.
Bottom line: You have a clever, organic architecture. To make it military-grade, you must shift from "trusting the network" to "assuming the network is hostile and the nodes are fragile." The protocols you need (BFT, QUIC, mTLS, Merkle DAGs) are all well-established. The hard part is integrating them without losing the lightweight, "cellular" philosophy that makes your mesh elegant.