Skip to content

keiailab/plexus

Plexus

Pool the VRAM of several consumer GPUs to run an LLM that none of them could run alone.

License: MIT Rust Status: early Contributor Covenant 2.1

Plexus is a distributed LLM inference mesh written in Rust. The goal is to let a handful of heterogeneous GPUs — different vendors, different VRAM sizes — act as a single, larger pool, so you can run a model that wouldn't fit on any one of them.

⚠️ Plexus is in early development. It does not run real distributed inference yet. The pieces that exist today are a single-machine walking skeleton and a CPU-only model forward pass. See Project Status before evaluating it. The roadmap below describes the intended system, most of which is not built.


What is Plexus?

A single 8 GB GPU can't load a 24 GB model. Plexus aims to make three 8 GB GPUs cooperate as one ~24 GB pool so the model can run. Memory-bound inference is split across machines (tensor / pipeline parallelism), each holding part of the weights and KV cache.

The longer-term vision is a vendor-neutral pool that mixes NVIDIA, Apple Silicon, AMD, Intel, and CPU nodes, with optional verifiability so you can trade latency for trust on a per-request basis. None of that cross-vendor / verifiable functionality exists yet — it is the direction, not the current state.

Who it's for: homelab and small-lab operators who already own a few budget GPUs and want to run larger models than a single card allows.

Project Status

Plexus is pre-alpha. Roughly 5.8k lines of Rust across 10 crates, with ~2.1k lines of tests. Here is an honest breakdown of what is real versus designed:

Area State Notes
Workspace, crates, CI, license/policy docs ✅ Done 10-crate Cargo workspace, lefthook gates, GitLab CI (shadow)
HTTP gateway with OpenAI-shaped /v1/chat/completions ✅ Works Request/response shape + X-Plexus-Trust header parsing
Worker gRPC service + ring forward ⚠️ Placeholder Returns a BLAKE3 hash of the input, not a real generation
Single-machine "walking skeleton" on Kubernetes ✅ Works Verified on a local kind cluster (gateway + 3 worker pods)
Layer-by-layer partitioning (even split across N workers) ✅ Works Pure arithmetic, unit-tested
Llama 3.1 8B CPU forward pass ⚠️ Partial Real attention/RoPE/RMSNorm/SwiGLU ops, deterministic, tested on a tiny config; uses a placeholder tokenizer (cl100k_base, not the real Llama vocab)
GPU kernels (CUDA/Metal/ROCm/SYCL) ❌ Not started plexus-kernel is an enum + TODO; the GPU backend stub panic!s
Heterogeneous multi-GPU pooling ❌ Design only The core value proposition — not implemented
Multi-node LAN cluster, public swarm, TEE attestation ❌ Design only Plans exist under docs/superpowers/plans/
Anthropic / Ollama / MCP API surfaces ❌ Not started Only the OpenAI shape exists today

In short: today Plexus can stand up a gateway and worker pods that pass tensors around and echo a placeholder, and it can run a small Llama forward pass on CPU. It cannot yet run a real model split across real GPUs. If you're looking for working distributed inference right now, see exo or petals.

Why Plexus (design goals)

Goal Intent
Heterogeneous pooling Treat mixed-vendor GPUs (NVIDIA / Apple / AMD / Intel / CPU) as one logical pool
Native Rust runtime Implement the inference stack directly (no Candle / vLLM / llama.cpp dependency)
Optional verifiability Per-request trust levels (fast / verified / attested) — design stage
Permissive open source MIT licensed; no token, NFT, or paywalled "enterprise edition"

Explicitly not goals

  • ❌ Yet another router that wraps existing inference engines — Plexus implements its own.
  • ❌ Crypto token / NFT / payment mechanism.
  • ❌ Single-vendor lock-in.
  • ❌ A Python inference path (Rust only).

Architecture

Plexus is organized as five layers, from GPU kernels up to the API gateway:

┌─────────────────────────────────────────────────────┐
│  Layer 4 — API & Gateway (OpenAI today; more planned)│
├─────────────────────────────────────────────────────┤
│  Layer 3 — Native inference runtime (Llama, CPU)     │
├─────────────────────────────────────────────────────┤
│  Layer 2 — Compute graph & scheduler (TP/PP) [design]│
├─────────────────────────────────────────────────────┤
│  Layer 1 — Distributed tensor & collective ops       │
├─────────────────────────────────────────────────────┤
│  Layer 0 — Kernel backend (CUDA/Metal/...) [planned] │
└─────────────────────────────────────────────────────┘

Full design (intended system, ~15 sections): docs/architecture/2026-05-22-plexus-design.md.

Workspace layout

plexus/
├── crates/
│   ├── plexus-core/        # error types, shared primitives
│   ├── plexus-tensor/      # tensor + device abstraction
│   ├── plexus-graph/       # shard / partition logic
│   ├── plexus-runtime/     # native model code (Llama CPU forward)
│   ├── plexus-gateway/     # OpenAI-shaped HTTP API
│   ├── plexus-worker/      # gRPC worker (placeholder forward)
│   ├── plexus-kernel/      # GPU backend enum (kernels: planned)
│   ├── plexus-verifier/    # verification primitives (planned)
│   ├── plexus-telemetry/   # metrics scaffolding
│   └── plexus-cli/         # `plexus` binary (serve / worker)
├── proto/                  # gRPC protobuf (inference.proto)
├── deploy/                 # Dockerfile + Kubernetes manifests
├── docs/                   # architecture, ADRs, operations, plans
└── tests/                  # integration / determinism / conformance / perf

Building

Requires the Rust toolchain pinned in rust-toolchain.toml (1.95.0); rustup fetches it automatically.

# build everything
cargo build --workspace

# run the test suite (unit + integration + e2e walking skeleton)
cargo test --workspace

# lint (Plexus uses a strict clippy gate — see CLAUDE.md / STYLE.md)
cargo clippy --workspace --all-targets

# run the gateway locally (binds to 127.0.0.1 by default)
cargo run -p plexus-cli -- serve --port 8080 --model test

# health check
curl localhost:8080/health        # -> {"status":"ok","version":"0.1.0"}

The CPU backend works without any GPU, so the current code can be built and tested on any machine.

The curl … install.sh one-liner and the heterogeneous-pool / swarm CLI flags described in the design doc are not available yet; they describe the target UX.

Roadmap

Phase-based, no fixed dates. [x] done, [~] partial, [ ] not started.

Phase State Scope
0 — Foundation [~] Scaffold, license/policy layer, single-machine walking skeleton
1 — Single GPU + real model [~] CPU Llama forward done; GPU backends + real tokenizer pending
2 — Heterogeneous pool ⭐ [ ] Mixed-vendor GPUs as one pool (the core wedge)
3 — LAN multi-node cluster [ ] libp2p + pipeline / cross-node tensor parallel
4 — Public swarm + verifiability [ ] Spot-check / dispute
5 — TEE attestation [ ] Confidential-compute backends
6 — API + multimodal [ ] Additional API surfaces, more model families
7 — v1.0 launch [ ] Security audit, broader worker testing

Detailed phase plans live under docs/superpowers/plans/ and docs/architecture/2026-05-22-plexus-design.md §10.

Contributing

Early-stage projects benefit most from focused contributions. See CONTRIBUTING.md for the full process. In short:

  • DCO sign-off on every commit (git commit -s)
  • Conventional Commits for messages
  • Open PRs against the dev branch
  • Include the output of cargo test --workspace / cargo clippy --workspace in the PR

Please also read the Code of Conduct.

Prior art & inspiration

Plexus is independent work, but it draws on ideas from:

License

  • Code: MIT
  • Docs: CC-BY-SA 4.0

Copyright (c) 2026 keiailab and Plexus Contributors.

About

Heterogeneous GPU mesh for verifiable LLM inference — run datacenter-class LLMs on homelab budget GPUs (Apache 2.0, Rust)

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors