Pool the VRAM of several consumer GPUs to run an LLM that none of them could run alone.
Plexus is a distributed LLM inference mesh written in Rust. The goal is to let a handful of heterogeneous GPUs — different vendors, different VRAM sizes — act as a single, larger pool, so you can run a model that wouldn't fit on any one of them.
⚠️ Plexus is in early development. It does not run real distributed inference yet. The pieces that exist today are a single-machine walking skeleton and a CPU-only model forward pass. See Project Status before evaluating it. The roadmap below describes the intended system, most of which is not built.
A single 8 GB GPU can't load a 24 GB model. Plexus aims to make three 8 GB GPUs cooperate as one ~24 GB pool so the model can run. Memory-bound inference is split across machines (tensor / pipeline parallelism), each holding part of the weights and KV cache.
The longer-term vision is a vendor-neutral pool that mixes NVIDIA, Apple Silicon, AMD, Intel, and CPU nodes, with optional verifiability so you can trade latency for trust on a per-request basis. None of that cross-vendor / verifiable functionality exists yet — it is the direction, not the current state.
Who it's for: homelab and small-lab operators who already own a few budget GPUs and want to run larger models than a single card allows.
Plexus is pre-alpha. Roughly 5.8k lines of Rust across 10 crates, with ~2.1k lines of tests. Here is an honest breakdown of what is real versus designed:
| Area | State | Notes |
|---|---|---|
| Workspace, crates, CI, license/policy docs | ✅ Done | 10-crate Cargo workspace, lefthook gates, GitLab CI (shadow) |
HTTP gateway with OpenAI-shaped /v1/chat/completions |
✅ Works | Request/response shape + X-Plexus-Trust header parsing |
| Worker gRPC service + ring forward | Returns a BLAKE3 hash of the input, not a real generation | |
| Single-machine "walking skeleton" on Kubernetes | ✅ Works | Verified on a local kind cluster (gateway + 3 worker pods) |
| Layer-by-layer partitioning (even split across N workers) | ✅ Works | Pure arithmetic, unit-tested |
| Llama 3.1 8B CPU forward pass | Real attention/RoPE/RMSNorm/SwiGLU ops, deterministic, tested on a tiny config; uses a placeholder tokenizer (cl100k_base, not the real Llama vocab) |
|
| GPU kernels (CUDA/Metal/ROCm/SYCL) | ❌ Not started | plexus-kernel is an enum + TODO; the GPU backend stub panic!s |
| Heterogeneous multi-GPU pooling | ❌ Design only | The core value proposition — not implemented |
| Multi-node LAN cluster, public swarm, TEE attestation | ❌ Design only | Plans exist under docs/superpowers/plans/ |
| Anthropic / Ollama / MCP API surfaces | ❌ Not started | Only the OpenAI shape exists today |
In short: today Plexus can stand up a gateway and worker pods that pass tensors around and echo a placeholder, and it can run a small Llama forward pass on CPU. It cannot yet run a real model split across real GPUs. If you're looking for working distributed inference right now, see exo or petals.
| Goal | Intent |
|---|---|
| Heterogeneous pooling | Treat mixed-vendor GPUs (NVIDIA / Apple / AMD / Intel / CPU) as one logical pool |
| Native Rust runtime | Implement the inference stack directly (no Candle / vLLM / llama.cpp dependency) |
| Optional verifiability | Per-request trust levels (fast / verified / attested) — design stage |
| Permissive open source | MIT licensed; no token, NFT, or paywalled "enterprise edition" |
- ❌ Yet another router that wraps existing inference engines — Plexus implements its own.
- ❌ Crypto token / NFT / payment mechanism.
- ❌ Single-vendor lock-in.
- ❌ A Python inference path (Rust only).
Plexus is organized as five layers, from GPU kernels up to the API gateway:
┌─────────────────────────────────────────────────────┐
│ Layer 4 — API & Gateway (OpenAI today; more planned)│
├─────────────────────────────────────────────────────┤
│ Layer 3 — Native inference runtime (Llama, CPU) │
├─────────────────────────────────────────────────────┤
│ Layer 2 — Compute graph & scheduler (TP/PP) [design]│
├─────────────────────────────────────────────────────┤
│ Layer 1 — Distributed tensor & collective ops │
├─────────────────────────────────────────────────────┤
│ Layer 0 — Kernel backend (CUDA/Metal/...) [planned] │
└─────────────────────────────────────────────────────┘
Full design (intended system, ~15 sections):
docs/architecture/2026-05-22-plexus-design.md.
plexus/
├── crates/
│ ├── plexus-core/ # error types, shared primitives
│ ├── plexus-tensor/ # tensor + device abstraction
│ ├── plexus-graph/ # shard / partition logic
│ ├── plexus-runtime/ # native model code (Llama CPU forward)
│ ├── plexus-gateway/ # OpenAI-shaped HTTP API
│ ├── plexus-worker/ # gRPC worker (placeholder forward)
│ ├── plexus-kernel/ # GPU backend enum (kernels: planned)
│ ├── plexus-verifier/ # verification primitives (planned)
│ ├── plexus-telemetry/ # metrics scaffolding
│ └── plexus-cli/ # `plexus` binary (serve / worker)
├── proto/ # gRPC protobuf (inference.proto)
├── deploy/ # Dockerfile + Kubernetes manifests
├── docs/ # architecture, ADRs, operations, plans
└── tests/ # integration / determinism / conformance / perf
Requires the Rust toolchain pinned in rust-toolchain.toml (1.95.0);
rustup fetches it automatically.
# build everything
cargo build --workspace
# run the test suite (unit + integration + e2e walking skeleton)
cargo test --workspace
# lint (Plexus uses a strict clippy gate — see CLAUDE.md / STYLE.md)
cargo clippy --workspace --all-targets
# run the gateway locally (binds to 127.0.0.1 by default)
cargo run -p plexus-cli -- serve --port 8080 --model test
# health check
curl localhost:8080/health # -> {"status":"ok","version":"0.1.0"}The CPU backend works without any GPU, so the current code can be built and tested on any machine.
The
curl … install.shone-liner and the heterogeneous-pool / swarm CLI flags described in the design doc are not available yet; they describe the target UX.
Phase-based, no fixed dates. [x] done, [~] partial, [ ] not started.
| Phase | State | Scope |
|---|---|---|
| 0 — Foundation | [~] |
Scaffold, license/policy layer, single-machine walking skeleton |
| 1 — Single GPU + real model | [~] |
CPU Llama forward done; GPU backends + real tokenizer pending |
| 2 — Heterogeneous pool ⭐ | [ ] |
Mixed-vendor GPUs as one pool (the core wedge) |
| 3 — LAN multi-node cluster | [ ] |
libp2p + pipeline / cross-node tensor parallel |
| 4 — Public swarm + verifiability | [ ] |
Spot-check / dispute |
| 5 — TEE attestation | [ ] |
Confidential-compute backends |
| 6 — API + multimodal | [ ] |
Additional API surfaces, more model families |
| 7 — v1.0 launch | [ ] |
Security audit, broader worker testing |
Detailed phase plans live under
docs/superpowers/plans/ and
docs/architecture/2026-05-22-plexus-design.md §10.
Early-stage projects benefit most from focused contributions. See CONTRIBUTING.md for the full process. In short:
- DCO sign-off on every commit (
git commit -s) - Conventional Commits for messages
- Open PRs against the
devbranch - Include the output of
cargo test --workspace/cargo clippy --workspacein the PR
Please also read the Code of Conduct.
Plexus is independent work, but it draws on ideas from:
- exo-explore/exo — LAN clusters + tensor parallelism
- bigscience-workshop/petals — global swarm + block partitioning
- Folding@home — non-monetary contribution model
- Code: MIT
- Docs: CC-BY-SA 4.0
Copyright (c) 2026 keiailab and Plexus Contributors.