You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enable multiple mlx-stack instances to auto-discover each other on the local network and present a single unified API endpoint. Incoming requests are routed to the node with the best combination of model availability, memory headroom, and queue depth. This is distributed serving (routing requests across independent machines), not distributed inference (splitting one model across machines) — making it fundamentally simpler and more reliable.
Problem
Many enthusiasts in the target audience have 2-4 Mac Minis. Today, each machine runs its own mlx-stack instance independently. To use multiple machines, the agent framework must:
Know about each machine's endpoint
Know which models are loaded on which machine
Implement its own load balancing
Handle failover when a machine goes down
No agent framework does this well. Most users just pick one machine and waste the others, or manually configure different agents to point at different machines.
Why Exo Isn't the Answer
Exo (exo-explore/exo) promises distributed inference — splitting a single large model across multiple machines. But it's plagued with reliability issues:
Nodes disappear during model downloads (GitHub #775)
mDNS discovery fails, clusters can't form (#1534)
No fault tolerance — if any node drops, the entire inference fails (#1325)
Only does one thing: split a model. The much more common need is capacity scaling.
Exo's approach is fundamentally fragile because splitting a model across unreliable consumer networking is hard. A single dropped packet during a tensor-parallel forward pass means a failed request.
Each node is a fully functional, self-sufficient mlx-stack instance. The mesh is an overlay, not a dependency:
Node goes down? Requests reroute to surviving nodes. Zero interruption.
Network flakes? Each node continues serving locally. Mesh reconnects when network recovers.
Coordinator goes down? Another node auto-elects as coordinator. The mesh self-heals.
User wants to disconnect a node? Just shut it down. The mesh adapts.
This is the fundamental difference from Exo: there is no shared state that must be consistent across nodes. Each node is independent. The mesh is additive — it makes the fleet better but isn't required for any single node to function.
Discovery
Phase 1: mDNS (zero-config local network)
Each mlx-stack instance advertises itself via mDNS (Bonjour):
The coordinator exposes the same OpenAI-compatible API at its :4000 endpoint:
# Agent framework points at the coordinatorexport OPENAI_API_BASE=http://coordinator.local:4000/v1
# Requests are transparently routed to the best node
curl http://coordinator.local:4000/v1/chat/completions \
-d '{"model": "standard", "messages": [...]}'# → routed to mac-mini-2 (shortest queue, model already loaded)
The agent framework doesn't know about individual nodes. It talks to one endpoint.
Model Registry
The coordinator maintains a unified view of all available models:
Nodes don't need identical configurations. A fleet might be:
2x Mac Mini M4 Pro 24GB running 8B models (fast tier workhorse)
1x Mac Mini M4 Pro 64GB running 32B model (standard tier)
1x Mac Studio M4 Ultra 192GB running 70B model (premium local tier)
The mesh handles this naturally: requests for "standard" route to the node(s) that have a standard-tier model loaded, weighted by queue depth.
Scaling Model
Machines
Benefit
1
No change — mlx-stack works as before
2
2x concurrent throughput, failover
3-4
3-4x concurrent throughput, one machine can go down with no impact
5+
Diminishing returns on throughput, but useful for running diverse model portfolios
For a user running OpenHands with 10 concurrent coding agents, going from 1 Mac Mini to 3 triples their throughput with zero configuration beyond mlx-stack install on each machine.
NOT Distributed Inference (Important Distinction)
This proposal is explicitly NOT about splitting a single large model across machines (tensor parallelism). That is:
Extremely sensitive to network latency (every forward pass requires synchronization)
Already attempted by Exo with poor reliability results on consumer hardware
Distributed serving is fundamentally simpler because each node is self-contained. The mesh just routes requests. If you want to run a 70B model and have one machine with enough memory, great — run it on that machine. If no single machine has enough memory, that's a hardware constraint, not a software one.
Exception: For Thunderbolt-connected machines (direct, low-latency link), we could optionally integrate with MLX Distributed for tensor-parallel inference. But this would be an advanced opt-in feature, not the default.
CLI Commands
# Enable mesh on this node
mlx-stack mesh enable# Check mesh status
mlx-stack mesh status
# List all nodes
mlx-stack mesh nodes
# Manually add a remote node
mlx-stack mesh add mac-mini-3.tailnet:4000
# Remove a node
mlx-stack mesh remove mac-mini-3
# Disable mesh (revert to standalone)
mlx-stack mesh disable
# View request routing history
mlx-stack mesh routes --last 100
v0.4 — High wow factor but smaller addressable audience than the v0.2/v0.3 features. Most users start with a single machine. This becomes compelling once the single-machine experience is polished and users want to scale.
Acceptance Criteria
mDNS auto-discovery finds mlx-stack instances on the local network
Coordinator auto-election works and re-elects on failure within 5 seconds
Unified API endpoint routes requests to the optimal node
Node failure triggers automatic rerouting with zero dropped requests
mlx-stack mesh status shows all nodes, models, and queue depths
Summary
Enable multiple mlx-stack instances to auto-discover each other on the local network and present a single unified API endpoint. Incoming requests are routed to the node with the best combination of model availability, memory headroom, and queue depth. This is distributed serving (routing requests across independent machines), not distributed inference (splitting one model across machines) — making it fundamentally simpler and more reliable.
Problem
Many enthusiasts in the target audience have 2-4 Mac Minis. Today, each machine runs its own mlx-stack instance independently. To use multiple machines, the agent framework must:
No agent framework does this well. Most users just pick one machine and waste the others, or manually configure different agents to point at different machines.
Why Exo Isn't the Answer
Exo (exo-explore/exo) promises distributed inference — splitting a single large model across multiple machines. But it's plagued with reliability issues:
Exo's approach is fundamentally fragile because splitting a model across unreliable consumer networking is hard. A single dropped packet during a tensor-parallel forward pass means a failed request.
Proposed Solution
Architecture
Key Design Principle: Independence First
Each node is a fully functional, self-sufficient mlx-stack instance. The mesh is an overlay, not a dependency:
This is the fundamental difference from Exo: there is no shared state that must be consistent across nodes. Each node is independent. The mesh is additive — it makes the fleet better but isn't required for any single node to function.
Discovery
Phase 1: mDNS (zero-config local network)
Each mlx-stack instance advertises itself via mDNS (Bonjour):
Nodes discover each other automatically. No configuration needed. Just
mlx-stack installon each machine and they find each other.Phase 2: Tailscale/WireGuard (remote nodes)
For nodes not on the same LAN (e.g., a Mac Mini at home and one at the office):
Coordinator Election
Simple leader election via mDNS priority + uptime tiebreaker:
The coordinator's responsibilities are lightweight:
Request Routing
The coordinator routes each request based on a scoring function:
Unified API
The coordinator exposes the same OpenAI-compatible API at its
:4000endpoint:The agent framework doesn't know about individual nodes. It talks to one endpoint.
Model Registry
The coordinator maintains a unified view of all available models:
Heterogeneous Fleet
Nodes don't need identical configurations. A fleet might be:
The mesh handles this naturally: requests for "standard" route to the node(s) that have a standard-tier model loaded, weighted by queue depth.
Scaling Model
For a user running OpenHands with 10 concurrent coding agents, going from 1 Mac Mini to 3 triples their throughput with zero configuration beyond
mlx-stack installon each machine.NOT Distributed Inference (Important Distinction)
This proposal is explicitly NOT about splitting a single large model across machines (tensor parallelism). That is:
Distributed serving is fundamentally simpler because each node is self-contained. The mesh just routes requests. If you want to run a 70B model and have one machine with enough memory, great — run it on that machine. If no single machine has enough memory, that's a hardware constraint, not a software one.
Exception: For Thunderbolt-connected machines (direct, low-latency link), we could optionally integrate with MLX Distributed for tensor-parallel inference. But this would be an advanced opt-in feature, not the default.
CLI Commands
Configuration
Implementation Phases
Phase 1: Discovery + Status (2-3 weeks)
mlx-stack mesh statuscommandPhase 2: Request Routing (2-3 weeks)
Phase 3: Advanced (2-3 weeks)
Priority
v0.4 — High wow factor but smaller addressable audience than the v0.2/v0.3 features. Most users start with a single machine. This becomes compelling once the single-machine experience is polished and users want to scale.
Acceptance Criteria
mlx-stack mesh statusshows all nodes, models, and queue depths