This guide walks you through building Spur from source, running a single-node cluster on your laptop, then expanding to a multi-node GPU cluster with WireGuard mesh networking.
Required:
- Linux (kernel 5.6+ for in-kernel WireGuard; any recent Ubuntu/Fedora/RHEL works)
- Rust toolchain (1.75+)
- Protocol Buffers compiler
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env
# Install protobuf compiler
sudo apt install protobuf-compiler # Debian/Ubuntu
sudo dnf install protobuf-compiler # Fedora/RHELOptional (for multi-node networking):
sudo apt install wireguard-tools # WireGuard CLI (wg, wg-quick)git clone https://github.com/ROCm/spur.git
cd spur
cargo build --releaseBinaries are in target/release/:
| Binary | Role |
|---|---|
spur |
CLI — submit jobs, check status, manage cluster |
spurctld |
Controller daemon — scheduling, state management |
spurd |
Node agent — runs on every compute node |
spurdbd |
Accounting daemon (optional, needs PostgreSQL) |
spurrestd |
REST API (optional) |
The fastest way to try Spur. Everything runs on one machine, no config file needed.
# Create state directory
mkdir -p /tmp/spur-state
# Start controller in foreground (uses built-in defaults)
./target/release/spurctld -D --state-dir /tmp/spur-state --log-level infoThe controller listens on port 6817 and creates a single "default" partition.
./target/release/spurd -D --controller http://localhost:6817The agent auto-discovers local CPUs, memory, and GPUs, then registers with the controller. You should see:
INFO spurd: resources discovered cpus=16 memory_mb=32000 gpus=0
INFO spurd::reporter: registered with controller
# Create a simple job script
cat > /tmp/hello.sh << 'EOF'
#!/bin/bash
#SBATCH --job-name=hello
#SBATCH --time=00:01:00
echo "Hello from Spur!"
echo "Running on $(hostname) with $SPUR_CPUS_ON_NODE CPUs"
sleep 2
echo "Done."
EOF
# Submit it
./target/release/spur submit /tmp/hello.shOutput: Submitted batch job 1
# View the job queue
./target/release/spur queue
# View nodes
./target/release/spur nodes
# View detailed job info
./target/release/spur show job 1
# Check job output (default location)
cat /tmp/spur-1.out# srun-style: run a command directly
./target/release/spur run hostname
./target/release/spur run -- bash -c "echo I have \$SPUR_CPUS_ON_NODE CPUs"If you're migrating from Slurm, create symlinks and use the familiar commands:
cd target/release
for cmd in sbatch srun squeue scancel sinfo sacct scontrol; do
ln -sf spur $cmd
done
# Now use Slurm commands directly
./sbatch /tmp/hello.sh
./squeue
./sinfo -N
./scancel 2For a real cluster across multiple machines. WireGuard creates an encrypted mesh so all nodes can reach each other regardless of firewalls or NAT.
controller (192.168.1.100) → WireGuard 10.44.0.1
gpu-node-1 (192.168.1.101) → WireGuard 10.44.0.2
gpu-node-2 (192.168.1.102) → WireGuard 10.44.0.3
On the controller:
# Initialize the mesh — generates keys, assigns 10.44.0.1, brings up spur0
sudo spur net init --cidr 10.44.0.0/16 --port 51820This prints the server public key and a join command template. Save the public key.
On each compute node:
# Join the mesh — generates local keys, connects to controller
sudo spur net join \
--endpoint 192.168.1.100:51820 \
--server-key <controller-pubkey> \
--address 10.44.0.2
# Then add this node as a peer on the controller
# (run on controller, using the pubkey printed by the join command)
sudo spur net add-peer \
--key <node-pubkey> \
--allowed-ip 10.44.0.2/32 \
--endpoint 192.168.1.101:51820Repeat for each node, incrementing the address (10.44.0.3, 10.44.0.4, ...).
Verify connectivity:
# On any node
spur net status # Shows WireGuard peers and handshake times
ping 10.44.0.1 # Ping the controller through the meshCreate /etc/spur/spur.conf (same file on all nodes, or distribute via your config management):
cluster_name = "gpu-cluster"
[controller]
listen_addr = "[::]:6817"
hosts = ["10.44.0.1"]
state_dir = "/var/spool/spur"
[scheduler]
plugin = "backfill"
interval_secs = 1
[network]
wg_enabled = true
wg_interface = "spur0"
agent_port = 6818
[[partitions]]
name = "gpu"
default = true
nodes = "gpu-node-[1-2]"
max_time = "72:00:00"
[[nodes]]
names = "gpu-node-[1-2]"
cpus = 128
memory_mb = 512000
gres = ["gpu:mi300x:8"]On the controller (10.44.0.1):
sudo mkdir -p /var/spool/spur
spurctld -D -f /etc/spur/spur.confOn each compute node:
spurd -D \
--controller http://10.44.0.1:6817 \
--hostname gpu-node-1 \
--listen [::]:6818The agent detects the spur0 WireGuard interface and registers with its 10.44.x.y address. The controller will dispatch jobs to that address.
# 2-node job
spur submit --nodes=2 --ntasks-per-node=8 train.sh
# Or with SBATCH directives
cat > train.sh << 'EOF'
#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH -N 2
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:mi300x:8
#SBATCH --time=4:00:00
echo "Node: $(hostname)"
echo "Peers: $SPUR_PEER_NODES"
echo "Task offset: $SPUR_TASK_OFFSET"
echo "Total nodes: $SPUR_NUM_NODES"
torchrun \
--nnodes=$SPUR_NUM_NODES \
--node_rank=$SPUR_TASK_OFFSET \
--master_addr=$(echo $SPUR_PEER_NODES | cut -d: -f1) \
--master_port=29500 \
--nproc_per_node=8 \
train.py
EOF
spur submit train.shWhen a multi-node job is dispatched, each node receives:
| Environment Variable | Example | Description |
|---|---|---|
SPUR_JOB_ID |
42 |
Job ID |
SPUR_NUM_NODES |
2 |
Total nodes in allocation |
SPUR_TASK_OFFSET |
0 or 8 |
This node's starting task index |
SPUR_PEER_NODES |
10.44.0.2:6818,10.44.0.3:6818 |
All nodes in the allocation |
SPUR_CPUS_ON_NODE |
128 |
CPUs allocated on this node |
cat > gpu-test.sh << 'EOF'
#!/bin/bash
#SBATCH --job-name=rocm-test
#SBATCH -N 1
#SBATCH --gres=gpu:mi300x:4
#SBATCH --time=00:10:00
rocm-smi
hipcc -o /tmp/vectoradd vectoradd.cpp && /tmp/vectoradd
EOF
spur submit gpu-test.shSpur sets ROCR_VISIBLE_DEVICES to restrict GPU visibility.
cat > cuda-test.sh << 'EOF'
#!/bin/bash
#SBATCH --job-name=cuda-test
#SBATCH --gres=gpu:h100:2
nvidia-smi
python -c "import torch; print(torch.cuda.device_count())"
EOF
spur submit cuda-test.shSpur sets CUDA_VISIBLE_DEVICES for NVIDIA isolation.
# Queue with formatting
spur queue # All jobs
spur queue -u $USER # Your jobs
spur queue --states=PENDING # Just pending
# Cancel
spur cancel 42 # Cancel by ID
spur cancel --user=alice # Cancel all of alice's jobs
# Hold and release
spur show control hold job 42 # Prevent scheduling
spur show control release job 42 # Allow scheduling
# Node management
spur nodes # All nodes
spur show node gpu-node-1 # Detailed info
# Drain a node for maintenance
spur show control update node gpu-node-1 state=drain reason="maintenance"
# Return to service
spur show control update node gpu-node-1 state=idleSubmit many similar jobs at once:
cat > array-job.sh << 'EOF'
#!/bin/bash
#SBATCH --job-name=sweep
#SBATCH --array=0-99%10
echo "Task $SLURM_ARRAY_TASK_ID of $SLURM_ARRAY_JOB_ID"
python train.py --lr=$(echo "0.001 * $SLURM_ARRAY_TASK_ID" | bc)
EOF
spur submit array-job.sh%10 limits concurrency to 10 tasks at a time.
Chain jobs together:
# Submit preprocessing
JOB1=$(spur submit preprocess.sh | grep -o '[0-9]*')
# Training starts after preprocessing succeeds
JOB2=$(spur submit --dependency=afterok:$JOB1 train.sh | grep -o '[0-9]*')
# Evaluation after training (even if it fails)
spur submit --dependency=afterany:$JOB2 eval.sh User: spur submit / sbatch / REST API
│
▼
┌─────────────────┐
│ spurctld │ Port 6817 (gRPC)
│ - backfill │ Port 6820 (REST, via spurrestd)
│ scheduler │
│ - Raft log + │
│ snapshots │
└────────┬────────┘
│ dispatches jobs via gRPC
┌──────┼──────┐
▼ ▼ ▼
spurd spurd spurd Port 6818 each
node1 node2 node3
│ │ │
└── WireGuard mesh (10.44.0.0/16) ──┘
- spurctld is the brain — one instance, manages all state, runs the scheduler
- spurd runs on every compute node — receives jobs, manages processes via cgroups v2
- Communication is all gRPC over WireGuard (encrypted, NAT-proof)
- State survives restarts via Raft log + periodic snapshots (always-on, even single-node)
Controller won't start:
# Check if port is in use
ss -tlnp | grep 6817
# Check state directory permissions
ls -la /var/spool/spur/Agent can't register:
# Test connectivity to controller
grpcurl -plaintext localhost:6817 list
# Or just curl the port
curl http://localhost:6817 # Will fail but proves port is reachable
# Check agent logs
spurd -D --controller http://10.44.0.1:6817 --log-level debugJobs stay PENDING:
# Check node state — must be "idle" or "mixed"
spur nodes
# Check if nodes are in the right partition
spur show node <name>
# Check job requirements vs available resources
spur show job <id>WireGuard not working:
spur net status # Shows interface and peer info
sudo wg show spur0 # Raw WireGuard status
ping 10.44.0.1 # Test mesh connectivity
ip addr show spur0 # Check interface has IPMulti-node job only runs on one node:
# Check that all nodes registered with their WireGuard IP
spur nodes
# Node addresses should be 10.44.x.y, not 127.0.0.1
spur show node gpu-node-1- Accounting: Start
spurdbdwith PostgreSQL for job history and fair-share scheduling - REST API: Start
spurrestdfor HTTP access (Slurm-compatible/slurm/v0.0.42/endpoints) - Prolog/Epilog: Set
SPUR_PROLOG/SPUR_EPILOGenvironment variables to run scripts before/after jobs - Kubernetes: K8s integration is planned for hybrid cloud+on-prem scheduling