Skip to content

Fix RandomAccessReader leak in OnDiskGraphIndexCompactor#13

Merged
eolivelli merged 3 commits into
mainfrom
fix-compactor-scratch-reader-leak
May 18, 2026
Merged

Fix RandomAccessReader leak in OnDiskGraphIndexCompactor#13
eolivelli merged 3 commits into
mainfrom
fix-compactor-scratch-reader-leak

Conversation

@eolivelli
Copy link
Copy Markdown
Owner

Summary

OnDiskGraphIndexCompactor.compactLevels() leaks a RandomAccessReader
for every source, on every executor worker thread, for every compaction.

compactLevels() runs its batch tasks on executor (which may be
caller-owned). Each worker thread lazily creates its own Scratch via a
ThreadLocal, and every Scratch holds one GraphSearcher — hence one
RandomAccessReader (via GraphIndex.View) — per source. At the end,
only the calling thread's Scratch is closed:

Scratch s = threadLocalScratch.get();
s.close();
threadLocalScratch.remove();

Every Scratch created on an executor worker thread is never closed,
so its GraphSearchers — and the RandomAccessReaders behind their
Views — leak for the lifetime of the executor (and accumulate across
compaction cycles when the executor is shared).

With an in-memory/mmap reader this is a benign FD/handle leak. With a
remote-backed reader (HerdDB's RemoteRandomAccessReader) each leaked
reader pins an off-heap Netty block buffer; over many compaction cycles
this exhausts the memory budget and stalls ingestion — see
eolivelli/herddb#590,
whose SEVERE: LEAK: ByteBuf.release() was not called traces all
originate in this code path.

A second, smaller leak: resolveEntryNodeSource() called
sources.get(s).getView() and never closed the returned View.

Changes

  • OnDiskGraphIndexCompactor.compactLevels(): register every Scratch
    created on any thread in a ConcurrentLinkedQueue and close them all
    in a finally block. On normal completion a close failure is
    surfaced; while a primary exception is propagating, close failures are
    logged so they cannot mask the original cause.
  • OnDiskGraphIndexCompactor.resolveEntryNodeSource(): wrap the
    getView() call in try-with-resources.
  • New testCompactionClosesAllReaders: wraps each source's
    ReaderSupplier so every vended RandomAccessReader records whether
    it was closed, runs a compaction on a caller-owned multi-threaded
    ForkJoinPool, and asserts no reader is left open once compact()
    returns. The test fails (5 still open) on main and passes with
    the fix.

Test plan

  • testCompactionClosesAllReaders passes with the fix, fails without it
  • Full TestOnDiskGraphIndexCompactor suite (9 tests) passes

🤖 Generated with Claude Code

eolivelli and others added 3 commits May 18, 2026 19:11
compactLevels() ran batch tasks on a (possibly caller-owned) executor
whose worker threads each lazily created a Scratch via a ThreadLocal.
Each Scratch holds one GraphSearcher — hence one RandomAccessReader —
per source. Only the calling thread's Scratch was closed, so every
worker thread's readers leaked for the lifetime of the executor.

With a remote-backed reader (HerdDB's RemoteRandomAccessReader) each
leaked reader pins an off-heap block buffer; over many compaction
cycles this exhausts the memory budget and stalls ingestion
(eolivelli/herddb#590).

Track every Scratch created on any thread in a ConcurrentLinkedQueue
and close them all in a finally block. resolveEntryNodeSource() had a
second, smaller leak — getView() opened a View that was never closed —
now fixed with try-with-resources.

Add testCompactionClosesAllReaders, which wraps each source's
ReaderSupplier so it records whether every vended RandomAccessReader
was closed, runs a compaction on a caller-owned multi-threaded pool,
and asserts no reader is left open once compact() returns. The test
fails (5 readers left open) without the fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses pr-reviewer findings on the compaction reader-leak fix:

- compactLevels() now tracks every submitted batch Future and, in its
  finally block, cancels and awaits all of them (awaitAllTasks) BEFORE
  draining the Scratch registry. Previously, when compaction failed,
  runBatchesWithBackpressure propagated immediately while other tasks
  were still in flight; such a task could run the ThreadLocal supplier
  and register a fresh Scratch after the drain had finished, leaking it.
  awaitAllTasks is a cheap no-op on the success path (all tasks already
  done) and also guarantees no task outlives compactLevels().

- The Scratch constructor now closes any GraphSearchers it already
  opened if a later source's getView() throws, so a partially built
  Scratch (never registered for the drain) does not leak readers.

- closeAllScratch / closeAllScratchQuietly now also catch RuntimeException
  so one failing Scratch.close() cannot abort closing the rest.

- Add testCompactionFailureClosesAllReaders: injects an IOException from
  a worker-thread vector read partway through compaction and asserts
  every vended RandomAccessReader is closed once compact() fails,
  covering the closeAllScratchQuietly path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses pr-reviewer iteration 2: Scratch.close() did
`for (var s : gs) s.close()`, so a single GraphSearcher.close()
(reader close()) throwing would leak the readers behind every
remaining searcher in the same Scratch — a residual instance of the
leak class this PR exists to fix.

Scratch.close() now closes every GraphSearcher best-effort: each
close() is wrapped in its own try/catch accumulating a firstError
(later failures attached as suppressed), the first failure is rethrown
after the loop, and null slots left by a partially-failed constructor
are skipped.

Also add testCompactionSurvivesReaderCloseFailure, which makes one
source's worker-thread readers throw from close() and asserts every
other source's readers are still closed after compact(). The test
fails (4 readers never closed) against the previous Scratch.close().

NIT: the level-0 and upper-layer onComplete lambdas now wrap their
IOException in UncheckedIOException instead of a bare RuntimeException.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eolivelli eolivelli merged commit 6fcd00d into main May 18, 2026
2 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant