Benchmark automation and lab workbooks for evaluating WEKA storage via virtiofs on Ubuntu 24.04 KVM hypervisors.
| File | Purpose |
|---|---|
bench_suite.sh |
End-to-end fio benchmark script — runs on the benchmark client |
checkpoint_bench.py |
PyTorch model checkpoint save/load benchmark (1/5/10 GB, CPU tensors) |
make_graphs.py |
Reads results, generates charts + WORKBOOK.md into runs/ |
terraform/ |
Terraform module + GCP environment for one-command lab provisioning |
scripts/ |
Helper scripts (e.g. create-weka-token-secret.sh) |
agents/ |
Agent coordination files — PLAN, STATUS, and per-role instructions for running benchmark sessions |
runs/ |
Timestamped results archive — one directory per run |
Each entry in runs/ is a self-contained lab result:
runs/
YYYY-MM-DD-<hostname>/
results.json — raw fio metrics (all 6 configurations)
virtiofs_hypervisor.png — UDP cached vs O_DIRECT hypervisor comparison
virtiofs_tldr.png — hypervisor reference + C/Rust virtiofsd VMs
virtiofs_checkpoint.png — PyTorch torch.save/load throughput by config + size
WORKBOOK.md — full writeup with tables, findings, charts
bench.log — full fio + virtiofsd output from the client
make.log — local orchestration log (terraform + scp output)
results/
checkpoint_hypervisor.json — raw checkpoint timings, hypervisor direct
checkpoint_c_virtiofsd.json — raw checkpoint timings, C virtiofsd guest
checkpoint_rust_virtiofsd.json — raw checkpoint timings, Rust virtiofsd guest
The live orchestration log during a run is /tmp/virtiofs-bench-run.log — tail it to monitor progress.
Terraform provisions a full WEKA test environment (backends + client VM) with one command, enabling reproducible lab runs without manual setup.
Expected cost: ~$34/run (standard, n2-standard-32 client); ~$57/run (full with cold phases, n2-highmem-64).
Prerequisites: Terraform ≥ 1.3, gcloud auth login && gcloud auth application-default login, access to team-cst GCP project.
git clone git@github.com:weka/virtiofs-bench.git
cd virtiofs-bench
cp terraform/environments/gcp-lab/terraform.tfvars.example \
terraform/environments/gcp-lab/terraform.tfvars
# Edit terraform.tfvars: set cluster_name (region/zone default to europe-west1-b)
# One-time token setup (recommended): store in Secret Manager, make picks it up automatically.
scripts/create-weka-token-secret.sh
# Fallback: set get_weka_io_token = "..." in terraform.tfvars instead.
make cluster-up # ~15–20 min — 6 backends + client VM, /mnt/weka mounted
make prereqs # fio, QEMU, Rust virtiofsd, Rocky 8, torch + checkpoint_bench.py (~15 min)
make bench # fio suite + PyTorch checkpoint benchmark (~90 min)
make results # copy results.json + checkpoint JSONs, generate graphs, commit + push
make cluster-down # destroy all GCP resources when doneOr run the full pipeline in one shot:
make run # cluster-up → prereqs → bench → results → cluster-down (auto-destroys)To keep the cluster up after the run for inspection:
make all # cluster-up → prereqs → bench → results (cluster stays up)
make cluster-down # destroy when doneFor full cold-phase runs (phases 6c/7c, requires 512 GB RAM):
make cluster-up BENCH_CLIENT=n2-highmem-64
make bench-fullMulti-cloud support (AWS, OCI) is in the multi-cloud branch.
Hypervisor (Ubuntu 24.04)
└── WEKA client (DPDK, 8 cores) → WEKA backends
└── virtiofsd (C or Rust) → vhost-user-fs socket
└── Guest VM (Rocky 8)
└── /mnt/weka (virtiofs mount, no WEKA agent in guest)
Hypervisor benchmarks run directly against /mnt/weka. VM benchmarks boot
a Rocky 8 guest via QEMU, mount via virtiofs, and run fio inside the guest.
See OVERVIEW.md for the full recommendation, benchmark numbers, fleet orchestration patterns, and VM RAM sizing guidance.
Short version (C virtiofsd, cache=none + O_DIRECT, 6-backend GCP cluster, kernel 6.8.0):
| Configuration | Seq Write | Rand 4K Read | Latency |
|---|---|---|---|
| Hypervisor O_DIRECT (reference) | 646 MiB/s | 8.4 KiOPS | 965 µs |
| C virtiofsd cache=auto | 2,287 MiB/s | 39.3 KiOPS | 78 µs |
| Rust virtiofsd cache=auto | 2,013 MiB/s | 39.7 KiOPS | 84 µs |
| C virtiofsd cache=none + O_DIRECT | 3,650 MiB/s | 56.9 KiOPS | 78 µs |
| Rust virtiofsd cache=none + O_DIRECT | 3,350 MiB/s | 53.5 KiOPS | 85 µs |
PyTorch checkpoint overhead (virtiofsd vs hypervisor direct): ~35–40% slower saves, ~38–47% slower loads. A 10 GB LLaMA-7B shard: ~22 s via virtiofsd vs ~14 s direct. At typical checkpoint cadence (every 3–40 min) this is <4% of training time.
Key rules:
- Use
--cache=none(C) /--cache=never(Rust) for AI training — bypasses host page cache; reads go directly through the WEKA stack to the distributed NVMe backends - Never use
--writeback— confirmed 3.7× sequential write collapse queue-size=1024on the vhost-user-fs device — default 128 bottlenecks throughput- C virtiofsd leads on kernel 6.8.0; both C and Rust are production-viable