Skip to content

Reactive in-VM PgBouncer scaler + warm-start (and two pooler bug fixes)#1

Merged
jaredLunde merged 6 commits into
mainfrom
pgbouncer-reactive-scaler
Jun 4, 2026
Merged

Reactive in-VM PgBouncer scaler + warm-start (and two pooler bug fixes)#1
jaredLunde merged 6 commits into
mainfrom
pgbouncer-reactive-scaler

Conversation

@jaredLunde
Copy link
Copy Markdown
Contributor

What this does

The single-threaded PgBouncer pooler is the one tier that can't self-scale (Postgres is process-per-connection and already memory-reactive). This makes the instance scale its own pooler from local signals: one worker when idle, up to vcpus/4 so_reuseport workers under connection-handshake load, reaped when it falls. The worker count survives every restart.

Empirical basis (bench/glidefs-pg/RESULTS.md)

Measured on a 12-vCPU QEMU rig and validated against the real product config in Docker (pgbouncer 1.25.2):

  • Multi-worker so_reuseport scales — pooler CPU linear (0.79→1.59 cores for 2 workers); throughput 1.5–1.8x/worker. A single pooler caps at ~2.4k TLS-churn conns/s/core.
  • TLS session resumption is dead as a pooler lever — TLS 1.3 keeps ECDHE on resume (1.0x) and standard pg clients never cache sessions. Cut.
  • Scale-down is graceful — SIGINT drains; 200 conns at 38k qps, kill one of two workers → 0.033% reconnect blip.
  • Real-config validation — 3 workers serve 90/90 scram+TLS queries, graceful reap.

Two pre-existing bugs found by that validation (fixed)

Neither was caught by CI: the supervisor tests only pgrep pgbouncer + check the role exists, never authenticate through :5432.

  1. setup_pgbouncer_auth never granted USAGE ON SCHEMA pgbouncer to the pgbouncer role → auth_query failed permission denied for schema pgbouncer for every client.
  2. config::pgbouncer_ini omitted client_tls_ca_file for self-signed certs → pgbouncer 1.25.x FATALs at boot (failed to load CA: (null)), so the self-signed fallback never started the pooler.

Plus a pool-sizing fix: default_pool_size is now the full pool, not pool/max_workers (which under-provisioned the common single-worker box: 64-vCPU got 24 instead of 192).

Warm-start across restarts

children.json is durable across reboots (GlideFS-backed rootfs), so cold boot restores the pre-restart worker count (clamped to current max_workers) instead of meeting the reconnect storm with one worker. Completes the story: handoff adopts live extras, snapshot resume preserves them, cold boot warm-starts.

Substrate-tuning wins (same measure-first campaign)

checkpoint_timeout 15→30min (frequent checkpoints wrote 2.9x more S3 at equal work); keep FPW on (CoW cache overwrites in place) and lz4 (zstd = +12.5% S3); ephemeral autovacuum throttle.

Testing

  • 122 unit tests (scaler decision, tick() orchestration incl. the worker-churn map, spawn/reap actuation, warm-start clamp); clippy -D warnings clean.
  • /proc sampling validated against a real busy process (0.993 cores).
  • Multi-worker real-config + graceful reap validated in Docker.
  • supervisor_warm_starts_pooler_after_restart e2e added (CI-only, like the 11 sibling lifecycle tests).

Deliberately deferred

  • The scaler's necessity is unproven until production pooler-CPU telemetry (the pooler RPC command exists to answer exactly that). Thresholds ship as rig-derived defaults.
  • Session resumption: cut. PSI-based scaling in a pooler cgroup: noted for v2, gated on the telemetry.

🤖 Generated with Claude Code

jaredLunde and others added 6 commits June 4, 2026 12:25
The single-threaded PgBouncer pooler saturates one core terminating TLS
under connection churn (~2.4k conns/s/core, measured). It is the only tier
that can't self-scale: Postgres is process-per-connection and already
memory-reactive. Make the instance scale its own pooler from local signals,
tracking usage (1 pooler when idle, up to the cap only when genuinely
CPU-bound) so a scaled-to-zero box never carries peak-provisioned workers.

Empirically grounded (bench/glidefs-pg/RESULTS.md §F/F′/F″, 12-vCPU rig):
  - so_reuseport multi-worker scales: pooler CPU is linear (0.79→1.59 cores
    for 2 workers); throughput 1.5–1.8x/worker. Proven, not assumed.
  - TLS session resumption is dead as a pooler lever: TLS 1.3 keeps ECDHE on
    resume (1.0x) and standard pg clients never cache sessions. Cut.
  - Scale-up is zero-disruption; scale-down via SIGINT drains gracefully
    (measured 0.033% reconnect blip). So the scaler is safe to shed workers.

Implementation:
  - supervisor.rs: PgbScaler as an inline select! arm (5s tick) — samples
    per-worker /proc CPU, decides via a pure, unit-tested decide_scale()
    (asymmetric hysteresis: up fast >0.75 core ×2 ticks, down slow <0.5 ×6
    ticks, cooldowns), and grows/SIGINT-reaps the so_reuseport worker set
    within [1, pgbouncer_max_workers]. Reuses the existing pgb_extra
    plumbing (handoff adoption, graceful drain, crash-reap).
  - children.rs: read_proc_cpu_ticks() — utime+stime from /proc/<pid>/stat,
    same robust last-paren parse as read_starttime. Validated against a real
    busy process (0.993 cores for one pinned core).
  - rpc.rs + handoff_bridge.rs: `pooler` RPC command exposing PoolerStats
    (live/max workers, per-worker cores, at_ceiling) — the production signal
    for whether a real box ever saturates a pooler. Type lives in
    handoff_bridge (library) since SharedState carries the handle.
  - config.rs: pgbouncer_ini uses the FULL pool per worker (was pool/cap,
    which under-provisioned the common 1-worker box: 64-vCPU got 24 not 192).
    default_pool_size is a ceiling; max_connections is the real backstop.
  - pgbouncer.ini: empty unix_socket_dir so multiple so_reuseport workers
    don't collide on /tmp/.s.PGSQL.5432.

Substrate-tuning wins (same measure-first campaign, already validated):
  - 00-beyond.conf: checkpoint_timeout 15→30min (frequent checkpoints wrote
    2.9x more S3 at equal work); keep FPW on (CoW cache overwrites in place,
    the ZFS trick doesn't transfer) and lz4 (zstd = +12.5% S3).
  - boot.rs: thread ram_bytes/vcpus into pgbouncer_ini for box-scaled sizing.

Also adds the reproducible benchmark rig (bench/glidefs-pg/) and a
bench:substrate mise task.

Tests: 6 decide_scale unit tests + updated pgbouncer_scales_with_box; full
lib+bin suite green; clippy -D warnings clean. The Docker supervisor
lifecycle suite can't run on this host (constrained container: read-only
/proc/sys, no privileged syscalls) — confirmed by identical failure on clean
main — so it must be exercised in CI before merge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion

Close the test gaps in the reactive scaler by adding seams, not infra — the
earlier "needs QEMU" framing was wrong (QEMU only isolated the GlideFS
substrate experiments; the scaler is pgbouncer + /proc + a port).

  - tick_with(): inject the CPU sampler so the whole orchestration — the
    per-pid prev-ticks map (proves a worker appearing mid-window can't spike
    the aggregate), the cores math, telemetry population, and the decide
    integration — is deterministically testable. 4 tests.
  - apply_scale_action(): extract spawn/reap actuation behind injected
    closures; test name-ordering, the worker-name cap, spawn failure, and that
    Down SIGINTs the last worker without removing it (wait_any_child does that
    on exit). 4 tests.
  - read_proc_cpu_ticks: real-/proc tests (busy-loop increases ticks; missing
    pid errors) — the manual 0.993-core check is now a regression guard.
  - scaler_tick_detects_real_cpu_load_and_decides_up (#[ignore]): real-process
    integration — production tick() over real /proc of a real CPU-burner reads
    ~1 core and decides Up. Runs here (no pgbouncer/QEMU needed); verified.

Remaining uncovered surface is narrow and characterized: spawn_pgbouncer
launching a real pgbouncer + so_reuseport CPU scaling + SIGINT drain — proven
empirically in bench/glidefs-pg §F/§F″, exercised automatically only by the
Docker supervisor lifecycle suite (CI; can't run on this host).

121 unit tests green; clippy -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ocker)

Run the committed pgbouncer.ini + the config::pgbouncer_ini-appended lines
(scram auth_query, TLS, unix_socket_dir=, so_reuseport=1) + the exact
setup_pgbouncer_auth SQL as 3 so_reuseport workers in the beyond-pg-test image
(pgbouncer 1.25.2). Result: 3 workers share :5432, 90/90 scram+TLS queries
served across all three, SIGINT reaps one gracefully (60/60 still served, 0
errors). This is the multi-worker deliverable validated against the real config
in the real pgbouncer — the piece the unit tests and the trust-auth rig couldn't.

Surfaced two pre-existing, CI-invisible bugs (the supervisor tests only pgrep
pgbouncer + check the role exists, never authenticate through :5432):
  1. self-signed cert omits client_tls_ca_file → pgbouncer 1.25.2 fatals at
     startup ("failed to load CA: (null)").
  2. setup_pgbouncer_auth never GRANTs USAGE on schema pgbouncer to the
     pgbouncer role → auth_query fails "permission denied for schema pgbouncer"
     for every client.
Both documented in RESULTS.md §Round 4 and worked around in the script; neither
is fixed in product code here (separable from the scaler).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…S bugs

Warm-start (the restart-resilience gap): a busy box that cold-reboots/crashes
faces a reconnect storm — peak TLS-handshake churn — and the old cold-start path
met it with ONE pooler, then crawled back up under the scaler's up-cooldown
(~90s under-provisioned). children.json is durable across reboots (GlideFS-backed
rootfs), so it already records how many extra workers were live. Cold start now
reads that count and warm-starts the so_reuseport workers to the pre-restart size
(clamped to the current max_workers for resize-safety); the scaler reaps any
excess within minutes if load doesn't justify it. It's a hint, not source-of-
truth — losing it just falls back to one worker. Completes the restart story:
handoff adopts extras (live), cold boot warm-starts them (respawn), snapshot
resume preserves them (hypervisor). warm_start_extra_count() unit-tested incl.
the downsize-clamp case.

Two pre-existing bugs found by the real-config Docker validation, fixed:
  - setup_pgbouncer_auth now GRANTs USAGE ON SCHEMA pgbouncer to the pgbouncer
    role. Without it auth_query fails "permission denied for schema pgbouncer"
    for every client — the pooler couldn't authenticate anyone.
  - config::pgbouncer_ini emits client_tls_ca_file for self-signed certs (pointed
    at the cert itself, its own issuer). pgbouncer 1.25.x FATALs at startup
    without a CA when a client cert is set ("failed to load CA: (null)"), so the
    self-signed fallback never started the pooler. Test updated accordingly.

Re-validated in the beyond-pg-test image: real config (now with both fixes)
serves 90/90 scram+TLS queries across 3 so_reuseport workers, graceful reap.
122 unit tests green; clippy -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ness helpers

Adds supervisor_warm_starts_pooler_after_restart: boot → seed children.json with
an extra worker (simulating a scaled-up box) → docker restart (cold start, same
container so the writable layer persists like a reboot) → assert the supervisor
warm-starts back to 2 pooler workers. New RunningContainer helpers: pgbouncer_count,
wait_pgbouncer_count, restart.

CI-only, like the 11 sibling supervisor lifecycle tests — the unprivileged harness
can't boot the supervisor on a host with a restrictive seccomp profile, and running
it with --privileged is unsafe on the shared homelab node (the supervisor's boot
reserves host hugepages and writes host sysctls — kernel state shared with
production). The warm-start LOGIC is unit-tested (warm_start_extra_count, incl. the
downsize clamp); this exercises the full boot→restart→restore wiring in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The pooler is no longer a single static process: it runs one worker when idle
and adds so_reuseport workers across cores under connection-handshake load, then
reaps them. The count survives restarts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jaredLunde jaredLunde merged commit f7fd1f1 into main Jun 4, 2026
1 check passed
@jaredLunde jaredLunde deleted the pgbouncer-reactive-scaler branch June 4, 2026 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant