Reactive in-VM PgBouncer scaler + warm-start (and two pooler bug fixes)#1
Merged
Conversation
The single-threaded PgBouncer pooler saturates one core terminating TLS
under connection churn (~2.4k conns/s/core, measured). It is the only tier
that can't self-scale: Postgres is process-per-connection and already
memory-reactive. Make the instance scale its own pooler from local signals,
tracking usage (1 pooler when idle, up to the cap only when genuinely
CPU-bound) so a scaled-to-zero box never carries peak-provisioned workers.
Empirically grounded (bench/glidefs-pg/RESULTS.md §F/F′/F″, 12-vCPU rig):
- so_reuseport multi-worker scales: pooler CPU is linear (0.79→1.59 cores
for 2 workers); throughput 1.5–1.8x/worker. Proven, not assumed.
- TLS session resumption is dead as a pooler lever: TLS 1.3 keeps ECDHE on
resume (1.0x) and standard pg clients never cache sessions. Cut.
- Scale-up is zero-disruption; scale-down via SIGINT drains gracefully
(measured 0.033% reconnect blip). So the scaler is safe to shed workers.
Implementation:
- supervisor.rs: PgbScaler as an inline select! arm (5s tick) — samples
per-worker /proc CPU, decides via a pure, unit-tested decide_scale()
(asymmetric hysteresis: up fast >0.75 core ×2 ticks, down slow <0.5 ×6
ticks, cooldowns), and grows/SIGINT-reaps the so_reuseport worker set
within [1, pgbouncer_max_workers]. Reuses the existing pgb_extra
plumbing (handoff adoption, graceful drain, crash-reap).
- children.rs: read_proc_cpu_ticks() — utime+stime from /proc/<pid>/stat,
same robust last-paren parse as read_starttime. Validated against a real
busy process (0.993 cores for one pinned core).
- rpc.rs + handoff_bridge.rs: `pooler` RPC command exposing PoolerStats
(live/max workers, per-worker cores, at_ceiling) — the production signal
for whether a real box ever saturates a pooler. Type lives in
handoff_bridge (library) since SharedState carries the handle.
- config.rs: pgbouncer_ini uses the FULL pool per worker (was pool/cap,
which under-provisioned the common 1-worker box: 64-vCPU got 24 not 192).
default_pool_size is a ceiling; max_connections is the real backstop.
- pgbouncer.ini: empty unix_socket_dir so multiple so_reuseport workers
don't collide on /tmp/.s.PGSQL.5432.
Substrate-tuning wins (same measure-first campaign, already validated):
- 00-beyond.conf: checkpoint_timeout 15→30min (frequent checkpoints wrote
2.9x more S3 at equal work); keep FPW on (CoW cache overwrites in place,
the ZFS trick doesn't transfer) and lz4 (zstd = +12.5% S3).
- boot.rs: thread ram_bytes/vcpus into pgbouncer_ini for box-scaled sizing.
Also adds the reproducible benchmark rig (bench/glidefs-pg/) and a
bench:substrate mise task.
Tests: 6 decide_scale unit tests + updated pgbouncer_scales_with_box; full
lib+bin suite green; clippy -D warnings clean. The Docker supervisor
lifecycle suite can't run on this host (constrained container: read-only
/proc/sys, no privileged syscalls) — confirmed by identical failure on clean
main — so it must be exercised in CI before merge.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion
Close the test gaps in the reactive scaler by adding seams, not infra — the
earlier "needs QEMU" framing was wrong (QEMU only isolated the GlideFS
substrate experiments; the scaler is pgbouncer + /proc + a port).
- tick_with(): inject the CPU sampler so the whole orchestration — the
per-pid prev-ticks map (proves a worker appearing mid-window can't spike
the aggregate), the cores math, telemetry population, and the decide
integration — is deterministically testable. 4 tests.
- apply_scale_action(): extract spawn/reap actuation behind injected
closures; test name-ordering, the worker-name cap, spawn failure, and that
Down SIGINTs the last worker without removing it (wait_any_child does that
on exit). 4 tests.
- read_proc_cpu_ticks: real-/proc tests (busy-loop increases ticks; missing
pid errors) — the manual 0.993-core check is now a regression guard.
- scaler_tick_detects_real_cpu_load_and_decides_up (#[ignore]): real-process
integration — production tick() over real /proc of a real CPU-burner reads
~1 core and decides Up. Runs here (no pgbouncer/QEMU needed); verified.
Remaining uncovered surface is narrow and characterized: spawn_pgbouncer
launching a real pgbouncer + so_reuseport CPU scaling + SIGINT drain — proven
empirically in bench/glidefs-pg §F/§F″, exercised automatically only by the
Docker supervisor lifecycle suite (CI; can't run on this host).
121 unit tests green; clippy -D warnings clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ocker)
Run the committed pgbouncer.ini + the config::pgbouncer_ini-appended lines
(scram auth_query, TLS, unix_socket_dir=, so_reuseport=1) + the exact
setup_pgbouncer_auth SQL as 3 so_reuseport workers in the beyond-pg-test image
(pgbouncer 1.25.2). Result: 3 workers share :5432, 90/90 scram+TLS queries
served across all three, SIGINT reaps one gracefully (60/60 still served, 0
errors). This is the multi-worker deliverable validated against the real config
in the real pgbouncer — the piece the unit tests and the trust-auth rig couldn't.
Surfaced two pre-existing, CI-invisible bugs (the supervisor tests only pgrep
pgbouncer + check the role exists, never authenticate through :5432):
1. self-signed cert omits client_tls_ca_file → pgbouncer 1.25.2 fatals at
startup ("failed to load CA: (null)").
2. setup_pgbouncer_auth never GRANTs USAGE on schema pgbouncer to the
pgbouncer role → auth_query fails "permission denied for schema pgbouncer"
for every client.
Both documented in RESULTS.md §Round 4 and worked around in the script; neither
is fixed in product code here (separable from the scaler).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…S bugs
Warm-start (the restart-resilience gap): a busy box that cold-reboots/crashes
faces a reconnect storm — peak TLS-handshake churn — and the old cold-start path
met it with ONE pooler, then crawled back up under the scaler's up-cooldown
(~90s under-provisioned). children.json is durable across reboots (GlideFS-backed
rootfs), so it already records how many extra workers were live. Cold start now
reads that count and warm-starts the so_reuseport workers to the pre-restart size
(clamped to the current max_workers for resize-safety); the scaler reaps any
excess within minutes if load doesn't justify it. It's a hint, not source-of-
truth — losing it just falls back to one worker. Completes the restart story:
handoff adopts extras (live), cold boot warm-starts them (respawn), snapshot
resume preserves them (hypervisor). warm_start_extra_count() unit-tested incl.
the downsize-clamp case.
Two pre-existing bugs found by the real-config Docker validation, fixed:
- setup_pgbouncer_auth now GRANTs USAGE ON SCHEMA pgbouncer to the pgbouncer
role. Without it auth_query fails "permission denied for schema pgbouncer"
for every client — the pooler couldn't authenticate anyone.
- config::pgbouncer_ini emits client_tls_ca_file for self-signed certs (pointed
at the cert itself, its own issuer). pgbouncer 1.25.x FATALs at startup
without a CA when a client cert is set ("failed to load CA: (null)"), so the
self-signed fallback never started the pooler. Test updated accordingly.
Re-validated in the beyond-pg-test image: real config (now with both fixes)
serves 90/90 scram+TLS queries across 3 so_reuseport workers, graceful reap.
122 unit tests green; clippy -D warnings clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ness helpers Adds supervisor_warm_starts_pooler_after_restart: boot → seed children.json with an extra worker (simulating a scaled-up box) → docker restart (cold start, same container so the writable layer persists like a reboot) → assert the supervisor warm-starts back to 2 pooler workers. New RunningContainer helpers: pgbouncer_count, wait_pgbouncer_count, restart. CI-only, like the 11 sibling supervisor lifecycle tests — the unprivileged harness can't boot the supervisor on a host with a restrictive seccomp profile, and running it with --privileged is unsafe on the shared homelab node (the supervisor's boot reserves host hugepages and writes host sysctls — kernel state shared with production). The warm-start LOGIC is unit-tested (warm_start_extra_count, incl. the downsize clamp); this exercises the full boot→restart→restore wiring in CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The pooler is no longer a single static process: it runs one worker when idle and adds so_reuseport workers across cores under connection-handshake load, then reaps them. The count survives restarts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
The single-threaded PgBouncer pooler is the one tier that can't self-scale (Postgres is process-per-connection and already memory-reactive). This makes the instance scale its own pooler from local signals: one worker when idle, up to
vcpus/4so_reuseportworkers under connection-handshake load, reaped when it falls. The worker count survives every restart.Empirical basis (
bench/glidefs-pg/RESULTS.md)Measured on a 12-vCPU QEMU rig and validated against the real product config in Docker (pgbouncer 1.25.2):
so_reuseportscales — pooler CPU linear (0.79→1.59 cores for 2 workers); throughput 1.5–1.8x/worker. A single pooler caps at ~2.4k TLS-churn conns/s/core.Two pre-existing bugs found by that validation (fixed)
Neither was caught by CI: the supervisor tests only
pgrep pgbouncer+ check the role exists, never authenticate through:5432.setup_pgbouncer_authnever grantedUSAGE ON SCHEMA pgbouncerto thepgbouncerrole →auth_queryfailedpermission denied for schema pgbouncerfor every client.config::pgbouncer_iniomittedclient_tls_ca_filefor self-signed certs → pgbouncer 1.25.x FATALs at boot (failed to load CA: (null)), so the self-signed fallback never started the pooler.Plus a pool-sizing fix:
default_pool_sizeis now the full pool, notpool/max_workers(which under-provisioned the common single-worker box: 64-vCPU got 24 instead of 192).Warm-start across restarts
children.jsonis durable across reboots (GlideFS-backed rootfs), so cold boot restores the pre-restart worker count (clamped to currentmax_workers) instead of meeting the reconnect storm with one worker. Completes the story: handoff adopts live extras, snapshot resume preserves them, cold boot warm-starts.Substrate-tuning wins (same measure-first campaign)
checkpoint_timeout15→30min (frequent checkpoints wrote 2.9x more S3 at equal work); keep FPW on (CoW cache overwrites in place) and lz4 (zstd = +12.5% S3); ephemeral autovacuum throttle.Testing
tick()orchestration incl. the worker-churn map, spawn/reap actuation, warm-start clamp); clippy-D warningsclean./procsampling validated against a real busy process (0.993 cores).supervisor_warm_starts_pooler_after_restarte2e added (CI-only, like the 11 sibling lifecycle tests).Deliberately deferred
poolerRPC command exists to answer exactly that). Thresholds ship as rig-derived defaults.🤖 Generated with Claude Code