WEKA VirtioFS Benchmark Suite

Benchmark automation and lab workbooks for evaluating WEKA storage via virtiofs on Ubuntu 24.04 KVM hypervisors.

What's here

File	Purpose
`bench_suite.sh`	End-to-end fio benchmark script — runs on the benchmark client
`checkpoint_bench.py`	PyTorch model checkpoint save/load benchmark (1/5/10 GB, CPU tensors)
`make_graphs.py`	Reads results, generates charts + WORKBOOK.md into `runs/`
`terraform/`	Terraform module + GCP environment for one-command lab provisioning
`scripts/`	Helper scripts (e.g. `create-weka-token-secret.sh`)
`agents/`	Agent coordination files — PLAN, STATUS, and per-role instructions for running benchmark sessions
`runs/`	Timestamped results archive — one directory per run

Run archive

Each entry in runs/ is a self-contained lab result:

runs/
  YYYY-MM-DD-<hostname>/
    results.json              — raw fio metrics (all 6 configurations)
    virtiofs_hypervisor.png   — UDP cached vs O_DIRECT hypervisor comparison
    virtiofs_tldr.png         — hypervisor reference + C/Rust virtiofsd VMs
    virtiofs_checkpoint.png   — PyTorch torch.save/load throughput by config + size
    WORKBOOK.md               — full writeup with tables, findings, charts
    bench.log                 — full fio + virtiofsd output from the client
    make.log                  — local orchestration log (terraform + scp output)

results/
    checkpoint_hypervisor.json    — raw checkpoint timings, hypervisor direct
    checkpoint_c_virtiofsd.json   — raw checkpoint timings, C virtiofsd guest
    checkpoint_rust_virtiofsd.json — raw checkpoint timings, Rust virtiofsd guest

The live orchestration log during a run is /tmp/virtiofs-bench-run.log — tail it to monitor progress.

Quickstart

Terraform provisions a full WEKA test environment (backends + client VM) with one command, enabling reproducible lab runs without manual setup.

Expected cost: ~$34/run (standard, n2-standard-32 client); ~$57/run (full with cold phases, n2-highmem-64).

Prerequisites: Terraform ≥ 1.3, gcloud auth login && gcloud auth application-default login, access to team-cst GCP project.

git clone git@github.com:weka/virtiofs-bench.git
cd virtiofs-bench
cp terraform/environments/gcp-lab/terraform.tfvars.example \
   terraform/environments/gcp-lab/terraform.tfvars
# Edit terraform.tfvars: set cluster_name (region/zone default to europe-west1-b)

# One-time token setup (recommended): store in Secret Manager, make picks it up automatically.
scripts/create-weka-token-secret.sh
# Fallback: set get_weka_io_token = "..." in terraform.tfvars instead.

make cluster-up    # ~15–20 min — 6 backends + client VM, /mnt/weka mounted
make prereqs       # fio, QEMU, Rust virtiofsd, Rocky 8, torch + checkpoint_bench.py (~15 min)
make bench         # fio suite + PyTorch checkpoint benchmark (~90 min)
make results       # copy results.json + checkpoint JSONs, generate graphs, commit + push

make cluster-down  # destroy all GCP resources when done

Or run the full pipeline in one shot:

make run           # cluster-up → prereqs → bench → results → cluster-down (auto-destroys)

To keep the cluster up after the run for inspection:

make all           # cluster-up → prereqs → bench → results (cluster stays up)
make cluster-down  # destroy when done

For full cold-phase runs (phases 6c/7c, requires 512 GB RAM):

make cluster-up BENCH_CLIENT=n2-highmem-64
make bench-full

Multi-cloud support (AWS, OCI) is in the multi-cloud branch.

Architecture

Hypervisor (Ubuntu 24.04)
  └── WEKA client (DPDK, 8 cores) → WEKA backends
  └── virtiofsd (C or Rust) → vhost-user-fs socket
        └── Guest VM (Rocky 8)
              └── /mnt/weka (virtiofs mount, no WEKA agent in guest)

Hypervisor benchmarks run directly against /mnt/weka. VM benchmarks boot a Rocky 8 guest via QEMU, mount via virtiofs, and run fio inside the guest.

Results overview

See OVERVIEW.md for the full recommendation, benchmark numbers, fleet orchestration patterns, and VM RAM sizing guidance.

Short version (C virtiofsd, cache=none + O_DIRECT, 6-backend GCP cluster, kernel 6.8.0):

Configuration	Seq Write	Rand 4K Read	Latency
Hypervisor O_DIRECT (reference)	646 MiB/s	8.4 KiOPS	965 µs
C virtiofsd cache=auto	2,287 MiB/s	39.3 KiOPS	78 µs
Rust virtiofsd cache=auto	2,013 MiB/s	39.7 KiOPS	84 µs
C virtiofsd cache=none + O_DIRECT	3,650 MiB/s	56.9 KiOPS	78 µs
Rust virtiofsd cache=none + O_DIRECT	3,350 MiB/s	53.5 KiOPS	85 µs

PyTorch checkpoint overhead (virtiofsd vs hypervisor direct): ~35–40% slower saves, ~38–47% slower loads. A 10 GB LLaMA-7B shard: ~22 s via virtiofsd vs ~14 s direct. At typical checkpoint cadence (every 3–40 min) this is <4% of training time.

Key rules:

Use --cache=none (C) / --cache=never (Rust) for AI training — bypasses host page cache; reads go directly through the WEKA stack to the distributed NVMe backends
Never use --writeback — confirmed 3.7× sequential write collapse
queue-size=1024 on the vhost-user-fs device — default 128 bottlenecks throughput
C virtiofsd leads on kernel 6.8.0; both C and Rust are production-viable

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
.claude		.claude
agents		agents
results		results
runs		runs
scripts		scripts
terraform		terraform
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Makefile		Makefile
OVERVIEW.md		OVERVIEW.md
README.md		README.md
bench_suite.sh		bench_suite.sh
checkpoint_bench.py		checkpoint_bench.py
make_graphs.py		make_graphs.py
recovery_v3.sh		recovery_v3.sh
virtiofs_bench.py		virtiofs_bench.py
virtiofs_hypervisor.py		virtiofs_hypervisor.py
virtiofs_tldr.py		virtiofs_tldr.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WEKA VirtioFS Benchmark Suite

What's here

Run archive

Quickstart

Architecture

Results overview

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WEKA VirtioFS Benchmark Suite

What's here

Run archive

Quickstart

Architecture

Results overview

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages