From 737826b4d8c8d8b52ccca1e422f5aae4716bd587 Mon Sep 17 00:00:00 2001 From: Jared Lunde Date: Thu, 4 Jun 2026 12:25:13 -0700 Subject: [PATCH 1/6] feat pgbouncer: reactive in-VM pooler scaler + telemetry MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The single-threaded PgBouncer pooler saturates one core terminating TLS under connection churn (~2.4k conns/s/core, measured). It is the only tier that can't self-scale: Postgres is process-per-connection and already memory-reactive. Make the instance scale its own pooler from local signals, tracking usage (1 pooler when idle, up to the cap only when genuinely CPU-bound) so a scaled-to-zero box never carries peak-provisioned workers. Empirically grounded (bench/glidefs-pg/RESULTS.md §F/F′/F″, 12-vCPU rig): - so_reuseport multi-worker scales: pooler CPU is linear (0.79→1.59 cores for 2 workers); throughput 1.5–1.8x/worker. Proven, not assumed. - TLS session resumption is dead as a pooler lever: TLS 1.3 keeps ECDHE on resume (1.0x) and standard pg clients never cache sessions. Cut. - Scale-up is zero-disruption; scale-down via SIGINT drains gracefully (measured 0.033% reconnect blip). So the scaler is safe to shed workers. Implementation: - supervisor.rs: PgbScaler as an inline select! arm (5s tick) — samples per-worker /proc CPU, decides via a pure, unit-tested decide_scale() (asymmetric hysteresis: up fast >0.75 core ×2 ticks, down slow <0.5 ×6 ticks, cooldowns), and grows/SIGINT-reaps the so_reuseport worker set within [1, pgbouncer_max_workers]. Reuses the existing pgb_extra plumbing (handoff adoption, graceful drain, crash-reap). - children.rs: read_proc_cpu_ticks() — utime+stime from /proc//stat, same robust last-paren parse as read_starttime. Validated against a real busy process (0.993 cores for one pinned core). - rpc.rs + handoff_bridge.rs: `pooler` RPC command exposing PoolerStats (live/max workers, per-worker cores, at_ceiling) — the production signal for whether a real box ever saturates a pooler. Type lives in handoff_bridge (library) since SharedState carries the handle. - config.rs: pgbouncer_ini uses the FULL pool per worker (was pool/cap, which under-provisioned the common 1-worker box: 64-vCPU got 24 not 192). default_pool_size is a ceiling; max_connections is the real backstop. - pgbouncer.ini: empty unix_socket_dir so multiple so_reuseport workers don't collide on /tmp/.s.PGSQL.5432. Substrate-tuning wins (same measure-first campaign, already validated): - 00-beyond.conf: checkpoint_timeout 15→30min (frequent checkpoints wrote 2.9x more S3 at equal work); keep FPW on (CoW cache overwrites in place, the ZFS trick doesn't transfer) and lz4 (zstd = +12.5% S3). - boot.rs: thread ram_bytes/vcpus into pgbouncer_ini for box-scaled sizing. Also adds the reproducible benchmark rig (bench/glidefs-pg/) and a bench:substrate mise task. Tests: 6 decide_scale unit tests + updated pgbouncer_scales_with_box; full lib+bin suite green; clippy -D warnings clean. The Docker supervisor lifecycle suite can't run on this host (constrained container: read-only /proc/sys, no privileged syscalls) — confirmed by identical failure on clean main — so it must be exercised in CI before merge. Co-Authored-By: Claude Opus 4.8 (1M context) --- bench/glidefs-pg/README.md | 79 ++++ bench/glidefs-pg/RESULTS.md | 204 ++++++++ bench/glidefs-pg/conf/a1-no-recycle.conf | 3 + bench/glidefs-pg/conf/baseline.conf | 43 ++ bench/glidefs-pg/conf/c1-checkpoint30.conf | 2 + bench/glidefs-pg/conf/c1-checkpoint60.conf | 2 + bench/glidefs-pg/conf/c1-cp300s.conf | 5 + bench/glidefs-pg/conf/c1-cp30s.conf | 5 + bench/glidefs-pg/conf/c1-cp45s.conf | 5 + bench/glidefs-pg/conf/c1-cp90s.conf | 5 + bench/glidefs-pg/conf/c2-flushafter-0.conf | 5 + bench/glidefs-pg/conf/c2-flushafter-2mb.conf | 4 + bench/glidefs-pg/conf/c3-aggressive.conf | 3 + bench/glidefs-pg/conf/c3-gentle.conf | 3 + bench/glidefs-pg/conf/c3-off.conf | 2 + bench/glidefs-pg/conf/d-maintenance-io.conf | 2 + bench/glidefs-pg/conf/d1-zstd.conf | 2 + .../glidefs-pg/conf/e1-autovac-throttled.conf | 6 + bench/glidefs-pg/delta.awk | 48 ++ bench/glidefs-pg/exp-d2-coldfork.sh | 102 ++++ bench/glidefs-pg/exp-d2.sh | 91 ++++ bench/glidefs-pg/exp-e-fork.sh | 104 +++++ bench/glidefs-pg/exp-e.sh | 92 ++++ bench/glidefs-pg/exp-f-pgbouncer.sh | 58 +++ bench/glidefs-pg/exp-multiworker.sh | 67 +++ bench/glidefs-pg/exp-pooler-ceiling.sh | 88 ++++ .../out/20260603-141450-smoke/glidefs.toml | 20 + .../out/20260603-141450-smoke/harness.conf | 40 ++ .../out/20260603-141450-smoke/postgres.log | 3 + .../out/20260603-141623-smoke/glidefs.toml | 20 + bench/glidefs-pg/qemu-bringup.sh | 71 +++ bench/glidefs-pg/qemu-runall.sh | 50 ++ bench/glidefs-pg/run.sh | 286 ++++++++++++ mise.toml | 18 + packer/files/pgbouncer/pgbouncer.ini | 12 +- packer/files/postgresql/00-beyond.conf | 12 +- src/boot.rs | 5 +- src/children.rs | 26 ++ src/config.rs | 130 +++++- src/handoff_bridge.rs | 40 +- src/rpc.rs | 43 +- src/supervisor.rs | 442 +++++++++++++++++- 42 files changed, 2223 insertions(+), 25 deletions(-) create mode 100644 bench/glidefs-pg/README.md create mode 100644 bench/glidefs-pg/RESULTS.md create mode 100644 bench/glidefs-pg/conf/a1-no-recycle.conf create mode 100644 bench/glidefs-pg/conf/baseline.conf create mode 100644 bench/glidefs-pg/conf/c1-checkpoint30.conf create mode 100644 bench/glidefs-pg/conf/c1-checkpoint60.conf create mode 100644 bench/glidefs-pg/conf/c1-cp300s.conf create mode 100644 bench/glidefs-pg/conf/c1-cp30s.conf create mode 100644 bench/glidefs-pg/conf/c1-cp45s.conf create mode 100644 bench/glidefs-pg/conf/c1-cp90s.conf create mode 100644 bench/glidefs-pg/conf/c2-flushafter-0.conf create mode 100644 bench/glidefs-pg/conf/c2-flushafter-2mb.conf create mode 100644 bench/glidefs-pg/conf/c3-aggressive.conf create mode 100644 bench/glidefs-pg/conf/c3-gentle.conf create mode 100644 bench/glidefs-pg/conf/c3-off.conf create mode 100644 bench/glidefs-pg/conf/d-maintenance-io.conf create mode 100644 bench/glidefs-pg/conf/d1-zstd.conf create mode 100644 bench/glidefs-pg/conf/e1-autovac-throttled.conf create mode 100755 bench/glidefs-pg/delta.awk create mode 100644 bench/glidefs-pg/exp-d2-coldfork.sh create mode 100644 bench/glidefs-pg/exp-d2.sh create mode 100644 bench/glidefs-pg/exp-e-fork.sh create mode 100644 bench/glidefs-pg/exp-e.sh create mode 100644 bench/glidefs-pg/exp-f-pgbouncer.sh create mode 100644 bench/glidefs-pg/exp-multiworker.sh create mode 100644 bench/glidefs-pg/exp-pooler-ceiling.sh create mode 100644 bench/glidefs-pg/out/20260603-141450-smoke/glidefs.toml create mode 100644 bench/glidefs-pg/out/20260603-141450-smoke/harness.conf create mode 100644 bench/glidefs-pg/out/20260603-141450-smoke/postgres.log create mode 100644 bench/glidefs-pg/out/20260603-141623-smoke/glidefs.toml create mode 100755 bench/glidefs-pg/qemu-bringup.sh create mode 100755 bench/glidefs-pg/qemu-runall.sh create mode 100755 bench/glidefs-pg/run.sh diff --git a/bench/glidefs-pg/README.md b/bench/glidefs-pg/README.md new file mode 100644 index 0000000..b053c39 --- /dev/null +++ b/bench/glidefs-pg/README.md @@ -0,0 +1,79 @@ +# GlideFS × Postgres substrate-tuning harness + +Measure-first rig for `plans/wal-recycle-off-*.md`. It puts a real Postgres data +dir on a real GlideFS block device backed by an in-memory / local-file / MinIO +object store (never real S3), runs a fixed pgbench workload, and scrapes +GlideFS's own per-export metrics before/after so each tuning knob is scored on +**S3 write cost and coalescing**, not intuition. + +## Why this exists + +The S3 cost of running Postgres on GlideFS is _distinct 128 KiB blocks flushed +per cycle_. Overwrite-before-flush coalesces for free. Every knob in the plan +(checkpoints, `*_flush_after`, bgwriter, `wal_compression`, `wal_recycle`, +autovacuum-on-CoW, `compaction_cooldown`) changes how many distinct blocks reach +the object store. This rig measures that directly via +`glidefs_s3_batches_written_total`, `glidefs_coalesce_ratio`, +`glidefs_write_amplification`. + +## Prereqs (already present on the homelab box) + +- `glidefs` binary, `nbd` + `ublk_drv` kernel modules loaded +- passwordless `sudo` (glidefs needs CAP_SYS_ADMIN for `/dev/nbdN`; mkfs/mount) +- Postgres 18 client tools (`initdb`, `pg_ctl`, `pgbench`, `psql`) + +## Run + +```sh +# one run (baseline) +mise run bench:substrate + +# a single candidate knob +bench/glidefs-pg/run.sh --conf bench/glidefs-pg/conf/c1-checkpoint60.conf --label c1-60 + +# full A/B sweep (baseline 3x for the noise floor, then each overlay) +bench/glidefs-pg/sweep.sh +``` + +Knobs: `--backend file|memory|minio` (default `file` — same byte accounting as +`memory`, no RAM blowup on long runs), `--scale`, `--duration`, `--clients`, +`--cooldown N` (GlideFS compaction_cooldown, the G experiment), `--transport +nbd|ublk`, `--keep`. + +## Output + +`out/-