Multi-Mac mesh networking: transparent distributed serving across Apple Silicon machines

## Summary

Enable multiple mlx-stack instances to auto-discover each other on the local network and present a single unified API endpoint. Incoming requests are routed to the node with the best combination of model availability, memory headroom, and queue depth. This is **distributed serving** (routing requests across independent machines), not distributed inference (splitting one model across machines) — making it fundamentally simpler and more reliable.

## Problem

Many enthusiasts in the target audience have 2-4 Mac Minis. Today, each machine runs its own mlx-stack instance independently. To use multiple machines, the agent framework must:
1. Know about each machine's endpoint
2. Know which models are loaded on which machine
3. Implement its own load balancing
4. Handle failover when a machine goes down

No agent framework does this well. Most users just pick one machine and waste the others, or manually configure different agents to point at different machines.

### Why Exo Isn't the Answer

Exo (exo-explore/exo) promises distributed inference — splitting a single large model across multiple machines. But it's plagued with reliability issues:
- Nodes disappear during model downloads (GitHub #775)
- mDNS discovery fails, clusters can't form (#1534)
- No fault tolerance — if any node drops, the entire inference fails (#1325)
- Only does one thing: split a model. The much more common need is capacity scaling.

Exo's approach is fundamentally fragile because splitting a model across unreliable consumer networking is hard. A single dropped packet during a tensor-parallel forward pass means a failed request.

## Proposed Solution

### Architecture

```
                    ┌─────────────────────────┐
                    │    Coordinator Node      │
                    │    (auto-elected)        │
                    │                          │
                    │  Unified API :4000       │
                    │  ┌───────────────────┐   │
                    │  │  Request Router   │   │
                    │  │  Node Health Map  │   │
                    │  │  Model Registry   │   │
                    │  └───────────────────┘   │
                    └──────────┬──────────────┘
                               │
              ┌────────────────┼────────────────┐
              │                │                │
    ┌─────────▼──────┐  ┌─────▼──────────┐  ┌──▼──────────────┐
    │  Mac Mini #1   │  │  Mac Mini #2   │  │  Mac Mini #3    │
    │  mlx-stack     │  │  mlx-stack     │  │  mlx-stack      │
    │                │  │                │  │                  │
    │  standard:     │  │  standard:     │  │  standard:       │
    │    qwen3-32b   │  │    qwen3-32b   │  │    deepseek-72b  │
    │  fast:         │  │  fast:         │  │  fast:           │
    │    qwen3-8b    │  │    qwen3-8b    │  │    qwen3-8b      │
    └────────────────┘  └────────────────┘  └─────────────────┘
```

### Key Design Principle: Independence First

Each node is a fully functional, self-sufficient mlx-stack instance. The mesh is an overlay, not a dependency:

- **Node goes down?** Requests reroute to surviving nodes. Zero interruption.
- **Network flakes?** Each node continues serving locally. Mesh reconnects when network recovers.
- **Coordinator goes down?** Another node auto-elects as coordinator. The mesh self-heals.
- **User wants to disconnect a node?** Just shut it down. The mesh adapts.

This is the fundamental difference from Exo: there is no shared state that must be consistent across nodes. Each node is independent. The mesh is additive — it makes the fleet better but isn't required for any single node to function.

### Discovery

**Phase 1: mDNS (zero-config local network)**

Each mlx-stack instance advertises itself via mDNS (Bonjour):
```
_mlx-stack._tcp.local.
  hostname: mac-mini-1.local
  port: 4000
  txt: version=0.3.0, tiers=standard,fast, models=qwen3-32b,qwen3-8b, memory_free=28GB
```

Nodes discover each other automatically. No configuration needed. Just `mlx-stack install` on each machine and they find each other.

**Phase 2: Tailscale/WireGuard (remote nodes)**

For nodes not on the same LAN (e.g., a Mac Mini at home and one at the office):
```yaml
mesh:
  discovery:
    - type: mdns         # auto-discover on local network
    - type: tailscale    # discover via Tailscale network
    - type: static       # manual node list
      nodes:
        - host: mac-mini-3.tailnet
          port: 4000
```

### Coordinator Election

Simple leader election via mDNS priority + uptime tiebreaker:
1. Node with the longest uptime becomes coordinator
2. On coordinator failure, the next-longest-uptime node takes over
3. Coordinator re-election takes <5 seconds
4. During election, each node falls back to serving only its own models (graceful degradation)

The coordinator's responsibilities are lightweight:
- Maintain the unified node/model registry
- Route incoming requests to the optimal node
- Health-check all nodes (piggyback on existing watchdog)
- Present the unified API endpoint

### Request Routing

The coordinator routes each request based on a scoring function:

```
score(node, request) = w1 * model_loaded(node, request.model)
                     + w2 * memory_headroom(node)
                     + w3 * queue_depth(node)
                     + w4 * prefix_cache_hit(node, request)  # if prefix caching is enabled
```

- **model_loaded:** Strong preference for nodes that already have the requested model loaded (avoids cold-start)
- **memory_headroom:** Prefer nodes with more free memory (better inference quality, less swap risk)
- **queue_depth:** Prefer nodes with shorter request queues (lower latency)
- **prefix_cache_hit:** If prefix caching (#30) is implemented, prefer nodes that have a warm cache for this request's prefix

### Unified API

The coordinator exposes the same OpenAI-compatible API at its `:4000` endpoint:

```bash
# Agent framework points at the coordinator
export OPENAI_API_BASE=http://coordinator.local:4000/v1

# Requests are transparently routed to the best node
curl http://coordinator.local:4000/v1/chat/completions \
  -d '{"model": "standard", "messages": [...]}'
# → routed to mac-mini-2 (shortest queue, model already loaded)
```

The agent framework doesn't know about individual nodes. It talks to one endpoint.

### Model Registry

The coordinator maintains a unified view of all available models:

```bash
mlx-stack mesh status
```

```
Mesh Status: 3 nodes, all healthy

Node             Chip          Memory    Models                    Queue
mac-mini-1       M4 Pro 64GB   28GB free standard:qwen3-32b       2
                                         fast:qwen3-8b             0
mac-mini-2       M4 Pro 48GB   20GB free standard:qwen3-32b       0
                                         fast:qwen3-8b             1
mac-mini-3       M4 Max 128GB  80GB free standard:deepseek-72b    1
                                         fast:qwen3-8b             0

Tier Resolution:
  "standard" → [mac-mini-1, mac-mini-2, mac-mini-3]  (3 nodes)
  "fast"     → [mac-mini-1, mac-mini-2, mac-mini-3]  (3 nodes)
  Throughput: ~3x single node for concurrent requests
```

### Heterogeneous Fleet

Nodes don't need identical configurations. A fleet might be:
- 2x Mac Mini M4 Pro 24GB running 8B models (fast tier workhorse)
- 1x Mac Mini M4 Pro 64GB running 32B model (standard tier)
- 1x Mac Studio M4 Ultra 192GB running 70B model (premium local tier)

The mesh handles this naturally: requests for "standard" route to the node(s) that have a standard-tier model loaded, weighted by queue depth.

### Scaling Model

| Machines | Benefit |
|----------|---------|
| 1 | No change — mlx-stack works as before |
| 2 | 2x concurrent throughput, failover |
| 3-4 | 3-4x concurrent throughput, one machine can go down with no impact |
| 5+ | Diminishing returns on throughput, but useful for running diverse model portfolios |

For a user running OpenHands with 10 concurrent coding agents, going from 1 Mac Mini to 3 triples their throughput with zero configuration beyond `mlx-stack install` on each machine.

### NOT Distributed Inference (Important Distinction)

This proposal is explicitly NOT about splitting a single large model across machines (tensor parallelism). That is:
- Extremely sensitive to network latency (every forward pass requires synchronization)
- Fragile (any node failure kills the inference)
- Complex (tensor sharding, pipeline parallelism, gradient synchronization)
- Already attempted by Exo with poor reliability results on consumer hardware

Distributed serving is fundamentally simpler because each node is self-contained. The mesh just routes requests. If you want to run a 70B model and have one machine with enough memory, great — run it on that machine. If no single machine has enough memory, that's a hardware constraint, not a software one.

**Exception:** For Thunderbolt-connected machines (direct, low-latency link), we could optionally integrate with MLX Distributed for tensor-parallel inference. But this would be an advanced opt-in feature, not the default.

### CLI Commands

```bash
# Enable mesh on this node
mlx-stack mesh enable

# Check mesh status
mlx-stack mesh status

# List all nodes
mlx-stack mesh nodes

# Manually add a remote node
mlx-stack mesh add mac-mini-3.tailnet:4000

# Remove a node
mlx-stack mesh remove mac-mini-3

# Disable mesh (revert to standalone)
mlx-stack mesh disable

# View request routing history
mlx-stack mesh routes --last 100
```

### Configuration

```yaml
mesh:
  enabled: true
  role: auto               # auto, coordinator, or worker
  discovery:
    - type: mdns
    - type: tailscale
  routing:
    strategy: scored        # scored, round_robin, random
    weights:
      model_loaded: 0.4
      memory_headroom: 0.2
      queue_depth: 0.3
      cache_affinity: 0.1
  health_check:
    interval: 10            # seconds
    timeout: 5
    unhealthy_threshold: 3  # consecutive failures before marking node down
  coordinator:
    port: 4000
    failover_timeout: 5     # seconds before re-election
```

## Implementation Phases

### Phase 1: Discovery + Status (2-3 weeks)
- mDNS advertisement and discovery
- Node registry with health checking
- `mlx-stack mesh status` command
- No request routing yet — just awareness

### Phase 2: Request Routing (2-3 weeks)
- Coordinator election
- Unified API endpoint on coordinator
- Scored request routing
- Failover on node failure

### Phase 3: Advanced (2-3 weeks)
- Tailscale/WireGuard discovery
- Prefix cache affinity routing (depends on #30)
- Dashboard integration
- Thunderbolt tensor-parallel opt-in

## Priority

v0.4 — High wow factor but smaller addressable audience than the v0.2/v0.3 features. Most users start with a single machine. This becomes compelling once the single-machine experience is polished and users want to scale.

## Acceptance Criteria

- [ ] mDNS auto-discovery finds mlx-stack instances on the local network
- [ ] Coordinator auto-election works and re-elects on failure within 5 seconds
- [ ] Unified API endpoint routes requests to the optimal node
- [ ] Node failure triggers automatic rerouting with zero dropped requests
- [ ] `mlx-stack mesh status` shows all nodes, models, and queue depths
- [ ] Heterogeneous fleets (different chips, memory, models) route correctly
- [ ] Mesh is purely additive — disabling it reverts to standalone operation with no side effects
- [ ] Static node configuration works for non-mDNS environments
- [ ] Request routing metrics (which node served which request)
- [ ] Documentation covering setup, topology, and troubleshooting
- [ ] Integration test with 2+ simulated nodes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Mac mesh networking: transparent distributed serving across Apple Silicon machines #31

Summary

Problem

Why Exo Isn't the Answer

Proposed Solution

Architecture

Key Design Principle: Independence First

Discovery

Coordinator Election

Request Routing

Unified API

Model Registry

Heterogeneous Fleet

Scaling Model

NOT Distributed Inference (Important Distinction)

CLI Commands

Configuration

Implementation Phases

Phase 1: Discovery + Status (2-3 weeks)

Phase 2: Request Routing (2-3 weeks)

Phase 3: Advanced (2-3 weeks)

Priority

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Machines	Benefit
1	No change — mlx-stack works as before
2	2x concurrent throughput, failover
3-4	3-4x concurrent throughput, one machine can go down with no impact
5+	Diminishing returns on throughput, but useful for running diverse model portfolios

Multi-Mac mesh networking: transparent distributed serving across Apple Silicon machines #31

Description

Summary

Problem

Why Exo Isn't the Answer

Proposed Solution

Architecture

Key Design Principle: Independence First

Discovery

Coordinator Election

Request Routing

Unified API

Model Registry

Heterogeneous Fleet

Scaling Model

NOT Distributed Inference (Important Distinction)

CLI Commands

Configuration

Implementation Phases

Phase 1: Discovery + Status (2-3 weeks)

Phase 2: Request Routing (2-3 weeks)

Phase 3: Advanced (2-3 weeks)

Priority

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions