fix(ublk): survive buffer-pool mmap OOM at worker init instead of aborting by jaredLunde · Pull Request #70 · beyondoss/glidefs

jaredLunde · 2026-06-05T06:48:58Z

Problem

ERROR glidefs: panic hook fired — worker buffer pool mmap failed — OOM at init:
  Os { code: 12, kind: OutOfMemory, "Cannot allocate memory" }
  at glidefs/src/block/ublk/buffer_pool.rs:257  thread="ublk-worker-6"
...
memory allocation of 184 bytes failed     → abort()

The per-worker USER_COPY bounce pool mmaps and pre-faults ~32 MB at worker init (16 workers × 256 slots × 128 KB = 512 MB committed in a startup burst), on top of recovering ~66 exports' caches. A transient host memory spike made one mmap return ENOMEM; the init path did .expect(...) → panic → the allocator then failed even 184-byte allocations → abort(). systemd's on-failure restart re-slammed the same 512 MB burst into a tight host every 10 s, re-triggering the OOM — a self-amplifying crash loop that only escaped when host pressure happened to ease.

Fix

The pool is a performance optimization (a bounded-RSS bounce buffer). The correct degraded behavior when it can't be mmapped is a heap buffer, not a dead daemon.

worker_pool() returns Option and never panics. Success is cached; failure is deliberately not cached, so a later I/O retries the mmap and the worker upgrades back to the bounded fast path once the host recovers. Logging is throttled to one line per degrade→recover transition (not per I/O while starved).
New IoBuf enum (pooled slot or heap vec) + acquire_io_buf() give the hot path a single uniform buffer type. The bounded async-backpressure path is unchanged when the pool exists.
New GLOBAL_HEAP_FALLBACKS counter + glidefs_ublk_buffer_pool_heap_fallbacks_total metric so a worker stuck on heap buffers (sustained degradation) is alertable.

Behavior change

Condition	Before	After
Pool mmap succeeds	bounded pool	bounded pool (unchanged)
Pool momentarily exhausted	async backpressure park	async backpressure park (unchanged)
Pool mmap fails (host OOM at init)	panic → abort → crash loop	heap fallback, daemon stays up, auto-recovers, metric increments

Test plan

cargo build --release -p glidefs --features ublk
buffer_pool unit tests (--features ublk)
Deployed to the homelab node via graceful handoff (systemctl reload)

🤖 Generated with Claude Code

…rting The per-worker USER_COPY bounce pool mmapped + pre-faulted ~32 MB at worker init and `.expect()`-panicked on failure. With 16 workers and many exports recovering at once, a transient host memory spike at startup made one mmap return ENOMEM; the panic hook fired, then even tiny heap allocations failed and the Rust allocator called abort() (SIGABRT). systemd's on-failure restart re-slammed the same 512 MB burst into a tight host every 10s, re-triggering the OOM until it hit the start-limit and gave up — a self-amplifying crash loop that only escaped when host pressure happened to ease. The pool is a performance optimization (bounded bounce RSS), so the correct degraded behavior when it can't be mmapped is a heap buffer, not a dead daemon: - `worker_pool()` now returns `Option` and never panics. Success is cached; failure is deliberately NOT cached, so a later I/O retries the mmap and the worker upgrades back to the bounded fast path on its own once the host recovers. Logging is throttled to one line per degrade→recover transition (not per I/O while starved). - New `IoBuf` enum (pooled slot or heap vec) and `acquire_io_buf()` give the hot path a single uniform buffer type. `io_task_user_copy` acquires through it; the bounded backpressure path is unchanged when the pool exists. - New `GLOBAL_HEAP_FALLBACKS` counter + `glidefs_ublk_buffer_pool_heap_fallbacks_total` metric so sustained degradation (a worker stuck on heap buffers) is alertable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jaredLunde merged commit 078c534 into main Jun 5, 2026
24 checks passed

jaredLunde deleted the fix/buffer-pool-oom-graceful-degrade branch June 5, 2026 07:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ublk): survive buffer-pool mmap OOM at worker init instead of aborting#70

fix(ublk): survive buffer-pool mmap OOM at worker init instead of aborting#70
jaredLunde merged 1 commit into
mainfrom
fix/buffer-pool-oom-graceful-degrade

jaredLunde commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jaredLunde commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Behavior change

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jaredLunde commented Jun 5, 2026 •

edited

Loading