Skip to content

Multi-Mac mesh networking: transparent distributed serving across Apple Silicon machines #31

@weklund-agent

Description

@weklund-agent

Summary

Enable multiple mlx-stack instances to auto-discover each other on the local network and present a single unified API endpoint. Incoming requests are routed to the node with the best combination of model availability, memory headroom, and queue depth. This is distributed serving (routing requests across independent machines), not distributed inference (splitting one model across machines) — making it fundamentally simpler and more reliable.

Problem

Many enthusiasts in the target audience have 2-4 Mac Minis. Today, each machine runs its own mlx-stack instance independently. To use multiple machines, the agent framework must:

  1. Know about each machine's endpoint
  2. Know which models are loaded on which machine
  3. Implement its own load balancing
  4. Handle failover when a machine goes down

No agent framework does this well. Most users just pick one machine and waste the others, or manually configure different agents to point at different machines.

Why Exo Isn't the Answer

Exo (exo-explore/exo) promises distributed inference — splitting a single large model across multiple machines. But it's plagued with reliability issues:

  • Nodes disappear during model downloads (GitHub #775)
  • mDNS discovery fails, clusters can't form (#1534)
  • No fault tolerance — if any node drops, the entire inference fails (#1325)
  • Only does one thing: split a model. The much more common need is capacity scaling.

Exo's approach is fundamentally fragile because splitting a model across unreliable consumer networking is hard. A single dropped packet during a tensor-parallel forward pass means a failed request.

Proposed Solution

Architecture

                    ┌─────────────────────────┐
                    │    Coordinator Node      │
                    │    (auto-elected)        │
                    │                          │
                    │  Unified API :4000       │
                    │  ┌───────────────────┐   │
                    │  │  Request Router   │   │
                    │  │  Node Health Map  │   │
                    │  │  Model Registry   │   │
                    │  └───────────────────┘   │
                    └──────────┬──────────────┘
                               │
              ┌────────────────┼────────────────┐
              │                │                │
    ┌─────────▼──────┐  ┌─────▼──────────┐  ┌──▼──────────────┐
    │  Mac Mini #1   │  │  Mac Mini #2   │  │  Mac Mini #3    │
    │  mlx-stack     │  │  mlx-stack     │  │  mlx-stack      │
    │                │  │                │  │                  │
    │  standard:     │  │  standard:     │  │  standard:       │
    │    qwen3-32b   │  │    qwen3-32b   │  │    deepseek-72b  │
    │  fast:         │  │  fast:         │  │  fast:           │
    │    qwen3-8b    │  │    qwen3-8b    │  │    qwen3-8b      │
    └────────────────┘  └────────────────┘  └─────────────────┘

Key Design Principle: Independence First

Each node is a fully functional, self-sufficient mlx-stack instance. The mesh is an overlay, not a dependency:

  • Node goes down? Requests reroute to surviving nodes. Zero interruption.
  • Network flakes? Each node continues serving locally. Mesh reconnects when network recovers.
  • Coordinator goes down? Another node auto-elects as coordinator. The mesh self-heals.
  • User wants to disconnect a node? Just shut it down. The mesh adapts.

This is the fundamental difference from Exo: there is no shared state that must be consistent across nodes. Each node is independent. The mesh is additive — it makes the fleet better but isn't required for any single node to function.

Discovery

Phase 1: mDNS (zero-config local network)

Each mlx-stack instance advertises itself via mDNS (Bonjour):

_mlx-stack._tcp.local.
  hostname: mac-mini-1.local
  port: 4000
  txt: version=0.3.0, tiers=standard,fast, models=qwen3-32b,qwen3-8b, memory_free=28GB

Nodes discover each other automatically. No configuration needed. Just mlx-stack install on each machine and they find each other.

Phase 2: Tailscale/WireGuard (remote nodes)

For nodes not on the same LAN (e.g., a Mac Mini at home and one at the office):

mesh:
  discovery:
    - type: mdns         # auto-discover on local network
    - type: tailscale    # discover via Tailscale network
    - type: static       # manual node list
      nodes:
        - host: mac-mini-3.tailnet
          port: 4000

Coordinator Election

Simple leader election via mDNS priority + uptime tiebreaker:

  1. Node with the longest uptime becomes coordinator
  2. On coordinator failure, the next-longest-uptime node takes over
  3. Coordinator re-election takes <5 seconds
  4. During election, each node falls back to serving only its own models (graceful degradation)

The coordinator's responsibilities are lightweight:

  • Maintain the unified node/model registry
  • Route incoming requests to the optimal node
  • Health-check all nodes (piggyback on existing watchdog)
  • Present the unified API endpoint

Request Routing

The coordinator routes each request based on a scoring function:

score(node, request) = w1 * model_loaded(node, request.model)
                     + w2 * memory_headroom(node)
                     + w3 * queue_depth(node)
                     + w4 * prefix_cache_hit(node, request)  # if prefix caching is enabled
  • model_loaded: Strong preference for nodes that already have the requested model loaded (avoids cold-start)
  • memory_headroom: Prefer nodes with more free memory (better inference quality, less swap risk)
  • queue_depth: Prefer nodes with shorter request queues (lower latency)
  • prefix_cache_hit: If prefix caching (Agent-aware prefix cache sharing for KV cache reuse across requests #30) is implemented, prefer nodes that have a warm cache for this request's prefix

Unified API

The coordinator exposes the same OpenAI-compatible API at its :4000 endpoint:

# Agent framework points at the coordinator
export OPENAI_API_BASE=http://coordinator.local:4000/v1

# Requests are transparently routed to the best node
curl http://coordinator.local:4000/v1/chat/completions \
  -d '{"model": "standard", "messages": [...]}'
# → routed to mac-mini-2 (shortest queue, model already loaded)

The agent framework doesn't know about individual nodes. It talks to one endpoint.

Model Registry

The coordinator maintains a unified view of all available models:

mlx-stack mesh status
Mesh Status: 3 nodes, all healthy

Node             Chip          Memory    Models                    Queue
mac-mini-1       M4 Pro 64GB   28GB free standard:qwen3-32b       2
                                         fast:qwen3-8b             0
mac-mini-2       M4 Pro 48GB   20GB free standard:qwen3-32b       0
                                         fast:qwen3-8b             1
mac-mini-3       M4 Max 128GB  80GB free standard:deepseek-72b    1
                                         fast:qwen3-8b             0

Tier Resolution:
  "standard" → [mac-mini-1, mac-mini-2, mac-mini-3]  (3 nodes)
  "fast"     → [mac-mini-1, mac-mini-2, mac-mini-3]  (3 nodes)
  Throughput: ~3x single node for concurrent requests

Heterogeneous Fleet

Nodes don't need identical configurations. A fleet might be:

  • 2x Mac Mini M4 Pro 24GB running 8B models (fast tier workhorse)
  • 1x Mac Mini M4 Pro 64GB running 32B model (standard tier)
  • 1x Mac Studio M4 Ultra 192GB running 70B model (premium local tier)

The mesh handles this naturally: requests for "standard" route to the node(s) that have a standard-tier model loaded, weighted by queue depth.

Scaling Model

Machines Benefit
1 No change — mlx-stack works as before
2 2x concurrent throughput, failover
3-4 3-4x concurrent throughput, one machine can go down with no impact
5+ Diminishing returns on throughput, but useful for running diverse model portfolios

For a user running OpenHands with 10 concurrent coding agents, going from 1 Mac Mini to 3 triples their throughput with zero configuration beyond mlx-stack install on each machine.

NOT Distributed Inference (Important Distinction)

This proposal is explicitly NOT about splitting a single large model across machines (tensor parallelism). That is:

  • Extremely sensitive to network latency (every forward pass requires synchronization)
  • Fragile (any node failure kills the inference)
  • Complex (tensor sharding, pipeline parallelism, gradient synchronization)
  • Already attempted by Exo with poor reliability results on consumer hardware

Distributed serving is fundamentally simpler because each node is self-contained. The mesh just routes requests. If you want to run a 70B model and have one machine with enough memory, great — run it on that machine. If no single machine has enough memory, that's a hardware constraint, not a software one.

Exception: For Thunderbolt-connected machines (direct, low-latency link), we could optionally integrate with MLX Distributed for tensor-parallel inference. But this would be an advanced opt-in feature, not the default.

CLI Commands

# Enable mesh on this node
mlx-stack mesh enable

# Check mesh status
mlx-stack mesh status

# List all nodes
mlx-stack mesh nodes

# Manually add a remote node
mlx-stack mesh add mac-mini-3.tailnet:4000

# Remove a node
mlx-stack mesh remove mac-mini-3

# Disable mesh (revert to standalone)
mlx-stack mesh disable

# View request routing history
mlx-stack mesh routes --last 100

Configuration

mesh:
  enabled: true
  role: auto               # auto, coordinator, or worker
  discovery:
    - type: mdns
    - type: tailscale
  routing:
    strategy: scored        # scored, round_robin, random
    weights:
      model_loaded: 0.4
      memory_headroom: 0.2
      queue_depth: 0.3
      cache_affinity: 0.1
  health_check:
    interval: 10            # seconds
    timeout: 5
    unhealthy_threshold: 3  # consecutive failures before marking node down
  coordinator:
    port: 4000
    failover_timeout: 5     # seconds before re-election

Implementation Phases

Phase 1: Discovery + Status (2-3 weeks)

  • mDNS advertisement and discovery
  • Node registry with health checking
  • mlx-stack mesh status command
  • No request routing yet — just awareness

Phase 2: Request Routing (2-3 weeks)

  • Coordinator election
  • Unified API endpoint on coordinator
  • Scored request routing
  • Failover on node failure

Phase 3: Advanced (2-3 weeks)

Priority

v0.4 — High wow factor but smaller addressable audience than the v0.2/v0.3 features. Most users start with a single machine. This becomes compelling once the single-machine experience is polished and users want to scale.

Acceptance Criteria

  • mDNS auto-discovery finds mlx-stack instances on the local network
  • Coordinator auto-election works and re-elects on failure within 5 seconds
  • Unified API endpoint routes requests to the optimal node
  • Node failure triggers automatic rerouting with zero dropped requests
  • mlx-stack mesh status shows all nodes, models, and queue depths
  • Heterogeneous fleets (different chips, memory, models) route correctly
  • Mesh is purely additive — disabling it reverts to standalone operation with no side effects
  • Static node configuration works for non-mDNS environments
  • Request routing metrics (which node served which request)
  • Documentation covering setup, topology, and troubleshooting
  • Integration test with 2+ simulated nodes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions