Reactive in-VM PgBouncer scaler + warm-start (and two pooler bug fixes) by jaredLunde · Pull Request #1 · beyondoss/postgres

jaredLunde · 2026-06-04T21:10:31Z

What this does

The single-threaded PgBouncer pooler is the one tier that can't self-scale (Postgres is process-per-connection and already memory-reactive). This makes the instance scale its own pooler from local signals: one worker when idle, up to vcpus/4 so_reuseport workers under connection-handshake load, reaped when it falls. The worker count survives every restart.

Empirical basis (`bench/glidefs-pg/RESULTS.md`)

Measured on a 12-vCPU QEMU rig and validated against the real product config in Docker (pgbouncer 1.25.2):

Multi-worker so_reuseport scales — pooler CPU linear (0.79→1.59 cores for 2 workers); throughput 1.5–1.8x/worker. A single pooler caps at ~2.4k TLS-churn conns/s/core.
TLS session resumption is dead as a pooler lever — TLS 1.3 keeps ECDHE on resume (1.0x) and standard pg clients never cache sessions. Cut.
Scale-down is graceful — SIGINT drains; 200 conns at 38k qps, kill one of two workers → 0.033% reconnect blip.
Real-config validation — 3 workers serve 90/90 scram+TLS queries, graceful reap.

Two pre-existing bugs found by that validation (fixed)

Neither was caught by CI: the supervisor tests only pgrep pgbouncer + check the role exists, never authenticate through :5432.

setup_pgbouncer_auth never granted USAGE ON SCHEMA pgbouncer to the pgbouncer role → auth_query failed permission denied for schema pgbouncer for every client.
config::pgbouncer_ini omitted client_tls_ca_file for self-signed certs → pgbouncer 1.25.x FATALs at boot (failed to load CA: (null)), so the self-signed fallback never started the pooler.

Plus a pool-sizing fix: default_pool_size is now the full pool, not pool/max_workers (which under-provisioned the common single-worker box: 64-vCPU got 24 instead of 192).

Warm-start across restarts

children.json is durable across reboots (GlideFS-backed rootfs), so cold boot restores the pre-restart worker count (clamped to current max_workers) instead of meeting the reconnect storm with one worker. Completes the story: handoff adopts live extras, snapshot resume preserves them, cold boot warm-starts.

Substrate-tuning wins (same measure-first campaign)

checkpoint_timeout 15→30min (frequent checkpoints wrote 2.9x more S3 at equal work); keep FPW on (CoW cache overwrites in place) and lz4 (zstd = +12.5% S3); ephemeral autovacuum throttle.

Testing

122 unit tests (scaler decision, tick() orchestration incl. the worker-churn map, spawn/reap actuation, warm-start clamp); clippy -D warnings clean.
/proc sampling validated against a real busy process (0.993 cores).
Multi-worker real-config + graceful reap validated in Docker.
supervisor_warm_starts_pooler_after_restart e2e added (CI-only, like the 11 sibling lifecycle tests).

Deliberately deferred

The scaler's necessity is unproven until production pooler-CPU telemetry (the pooler RPC command exists to answer exactly that). Thresholds ship as rig-derived defaults.
Session resumption: cut. PSI-based scaling in a pooler cgroup: noted for v2, gated on the telemetry.

🤖 Generated with Claude Code

The single-threaded PgBouncer pooler saturates one core terminating TLS under connection churn (~2.4k conns/s/core, measured). It is the only tier that can't self-scale: Postgres is process-per-connection and already memory-reactive. Make the instance scale its own pooler from local signals, tracking usage (1 pooler when idle, up to the cap only when genuinely CPU-bound) so a scaled-to-zero box never carries peak-provisioned workers. Empirically grounded (bench/glidefs-pg/RESULTS.md §F/F′/F″, 12-vCPU rig): - so_reuseport multi-worker scales: pooler CPU is linear (0.79→1.59 cores for 2 workers); throughput 1.5–1.8x/worker. Proven, not assumed. - TLS session resumption is dead as a pooler lever: TLS 1.3 keeps ECDHE on resume (1.0x) and standard pg clients never cache sessions. Cut. - Scale-up is zero-disruption; scale-down via SIGINT drains gracefully (measured 0.033% reconnect blip). So the scaler is safe to shed workers. Implementation: - supervisor.rs: PgbScaler as an inline select! arm (5s tick) — samples per-worker /proc CPU, decides via a pure, unit-tested decide_scale() (asymmetric hysteresis: up fast >0.75 core ×2 ticks, down slow <0.5 ×6 ticks, cooldowns), and grows/SIGINT-reaps the so_reuseport worker set within [1, pgbouncer_max_workers]. Reuses the existing pgb_extra plumbing (handoff adoption, graceful drain, crash-reap). - children.rs: read_proc_cpu_ticks() — utime+stime from /proc/<pid>/stat, same robust last-paren parse as read_starttime. Validated against a real busy process (0.993 cores for one pinned core). - rpc.rs + handoff_bridge.rs: `pooler` RPC command exposing PoolerStats (live/max workers, per-worker cores, at_ceiling) — the production signal for whether a real box ever saturates a pooler. Type lives in handoff_bridge (library) since SharedState carries the handle. - config.rs: pgbouncer_ini uses the FULL pool per worker (was pool/cap, which under-provisioned the common 1-worker box: 64-vCPU got 24 not 192). default_pool_size is a ceiling; max_connections is the real backstop. - pgbouncer.ini: empty unix_socket_dir so multiple so_reuseport workers don't collide on /tmp/.s.PGSQL.5432. Substrate-tuning wins (same measure-first campaign, already validated): - 00-beyond.conf: checkpoint_timeout 15→30min (frequent checkpoints wrote 2.9x more S3 at equal work); keep FPW on (CoW cache overwrites in place, the ZFS trick doesn't transfer) and lz4 (zstd = +12.5% S3). - boot.rs: thread ram_bytes/vcpus into pgbouncer_ini for box-scaled sizing. Also adds the reproducible benchmark rig (bench/glidefs-pg/) and a bench:substrate mise task. Tests: 6 decide_scale unit tests + updated pgbouncer_scales_with_box; full lib+bin suite green; clippy -D warnings clean. The Docker supervisor lifecycle suite can't run on this host (constrained container: read-only /proc/sys, no privileged syscalls) — confirmed by identical failure on clean main — so it must be exercised in CI before merge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tion Close the test gaps in the reactive scaler by adding seams, not infra — the earlier "needs QEMU" framing was wrong (QEMU only isolated the GlideFS substrate experiments; the scaler is pgbouncer + /proc + a port). - tick_with(): inject the CPU sampler so the whole orchestration — the per-pid prev-ticks map (proves a worker appearing mid-window can't spike the aggregate), the cores math, telemetry population, and the decide integration — is deterministically testable. 4 tests. - apply_scale_action(): extract spawn/reap actuation behind injected closures; test name-ordering, the worker-name cap, spawn failure, and that Down SIGINTs the last worker without removing it (wait_any_child does that on exit). 4 tests. - read_proc_cpu_ticks: real-/proc tests (busy-loop increases ticks; missing pid errors) — the manual 0.993-core check is now a regression guard. - scaler_tick_detects_real_cpu_load_and_decides_up (#[ignore]): real-process integration — production tick() over real /proc of a real CPU-burner reads ~1 core and decides Up. Runs here (no pgbouncer/QEMU needed); verified. Remaining uncovered surface is narrow and characterized: spawn_pgbouncer launching a real pgbouncer + so_reuseport CPU scaling + SIGINT drain — proven empirically in bench/glidefs-pg §F/§F″, exercised automatically only by the Docker supervisor lifecycle suite (CI; can't run on this host). 121 unit tests green; clippy -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ocker) Run the committed pgbouncer.ini + the config::pgbouncer_ini-appended lines (scram auth_query, TLS, unix_socket_dir=, so_reuseport=1) + the exact setup_pgbouncer_auth SQL as 3 so_reuseport workers in the beyond-pg-test image (pgbouncer 1.25.2). Result: 3 workers share :5432, 90/90 scram+TLS queries served across all three, SIGINT reaps one gracefully (60/60 still served, 0 errors). This is the multi-worker deliverable validated against the real config in the real pgbouncer — the piece the unit tests and the trust-auth rig couldn't. Surfaced two pre-existing, CI-invisible bugs (the supervisor tests only pgrep pgbouncer + check the role exists, never authenticate through :5432): 1. self-signed cert omits client_tls_ca_file → pgbouncer 1.25.2 fatals at startup ("failed to load CA: (null)"). 2. setup_pgbouncer_auth never GRANTs USAGE on schema pgbouncer to the pgbouncer role → auth_query fails "permission denied for schema pgbouncer" for every client. Both documented in RESULTS.md §Round 4 and worked around in the script; neither is fixed in product code here (separable from the scaler). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…S bugs Warm-start (the restart-resilience gap): a busy box that cold-reboots/crashes faces a reconnect storm — peak TLS-handshake churn — and the old cold-start path met it with ONE pooler, then crawled back up under the scaler's up-cooldown (~90s under-provisioned). children.json is durable across reboots (GlideFS-backed rootfs), so it already records how many extra workers were live. Cold start now reads that count and warm-starts the so_reuseport workers to the pre-restart size (clamped to the current max_workers for resize-safety); the scaler reaps any excess within minutes if load doesn't justify it. It's a hint, not source-of- truth — losing it just falls back to one worker. Completes the restart story: handoff adopts extras (live), cold boot warm-starts them (respawn), snapshot resume preserves them (hypervisor). warm_start_extra_count() unit-tested incl. the downsize-clamp case. Two pre-existing bugs found by the real-config Docker validation, fixed: - setup_pgbouncer_auth now GRANTs USAGE ON SCHEMA pgbouncer to the pgbouncer role. Without it auth_query fails "permission denied for schema pgbouncer" for every client — the pooler couldn't authenticate anyone. - config::pgbouncer_ini emits client_tls_ca_file for self-signed certs (pointed at the cert itself, its own issuer). pgbouncer 1.25.x FATALs at startup without a CA when a client cert is set ("failed to load CA: (null)"), so the self-signed fallback never started the pooler. Test updated accordingly. Re-validated in the beyond-pg-test image: real config (now with both fixes) serves 90/90 scram+TLS queries across 3 so_reuseport workers, graceful reap. 122 unit tests green; clippy -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ness helpers Adds supervisor_warm_starts_pooler_after_restart: boot → seed children.json with an extra worker (simulating a scaled-up box) → docker restart (cold start, same container so the writable layer persists like a reboot) → assert the supervisor warm-starts back to 2 pooler workers. New RunningContainer helpers: pgbouncer_count, wait_pgbouncer_count, restart. CI-only, like the 11 sibling supervisor lifecycle tests — the unprivileged harness can't boot the supervisor on a host with a restrictive seccomp profile, and running it with --privileged is unsafe on the shared homelab node (the supervisor's boot reserves host hugepages and writes host sysctls — kernel state shared with production). The warm-start LOGIC is unit-tested (warm_start_extra_count, incl. the downsize clamp); this exercises the full boot→restart→restore wiring in CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The pooler is no longer a single static process: it runs one worker when idle and adds so_reuseport workers across cores under connection-handshake load, then reaps them. The count survives restarts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jaredLunde and others added 6 commits June 4, 2026 12:25

jaredLunde merged commit f7fd1f1 into main Jun 4, 2026
1 check passed

jaredLunde deleted the pgbouncer-reactive-scaler branch June 4, 2026 21:23

jaredLunde mentioned this pull request Jun 4, 2026

chore(bench): drop committed rig run-artifacts; fix gitignore path #2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reactive in-VM PgBouncer scaler + warm-start (and two pooler bug fixes)#1

Reactive in-VM PgBouncer scaler + warm-start (and two pooler bug fixes)#1
jaredLunde merged 6 commits into
mainfrom
pgbouncer-reactive-scaler

jaredLunde commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jaredLunde commented Jun 4, 2026

What this does

Empirical basis (bench/glidefs-pg/RESULTS.md)

Two pre-existing bugs found by that validation (fixed)

Warm-start across restarts

Substrate-tuning wins (same measure-first campaign)

Testing

Deliberately deferred

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Empirical basis (`bench/glidefs-pg/RESULTS.md`)