Fix RandomAccessReader leak in OnDiskGraphIndexCompactor#13
Merged
Conversation
compactLevels() ran batch tasks on a (possibly caller-owned) executor whose worker threads each lazily created a Scratch via a ThreadLocal. Each Scratch holds one GraphSearcher — hence one RandomAccessReader — per source. Only the calling thread's Scratch was closed, so every worker thread's readers leaked for the lifetime of the executor. With a remote-backed reader (HerdDB's RemoteRandomAccessReader) each leaked reader pins an off-heap block buffer; over many compaction cycles this exhausts the memory budget and stalls ingestion (eolivelli/herddb#590). Track every Scratch created on any thread in a ConcurrentLinkedQueue and close them all in a finally block. resolveEntryNodeSource() had a second, smaller leak — getView() opened a View that was never closed — now fixed with try-with-resources. Add testCompactionClosesAllReaders, which wraps each source's ReaderSupplier so it records whether every vended RandomAccessReader was closed, runs a compaction on a caller-owned multi-threaded pool, and asserts no reader is left open once compact() returns. The test fails (5 readers left open) without the fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses pr-reviewer findings on the compaction reader-leak fix: - compactLevels() now tracks every submitted batch Future and, in its finally block, cancels and awaits all of them (awaitAllTasks) BEFORE draining the Scratch registry. Previously, when compaction failed, runBatchesWithBackpressure propagated immediately while other tasks were still in flight; such a task could run the ThreadLocal supplier and register a fresh Scratch after the drain had finished, leaking it. awaitAllTasks is a cheap no-op on the success path (all tasks already done) and also guarantees no task outlives compactLevels(). - The Scratch constructor now closes any GraphSearchers it already opened if a later source's getView() throws, so a partially built Scratch (never registered for the drain) does not leak readers. - closeAllScratch / closeAllScratchQuietly now also catch RuntimeException so one failing Scratch.close() cannot abort closing the rest. - Add testCompactionFailureClosesAllReaders: injects an IOException from a worker-thread vector read partway through compaction and asserts every vended RandomAccessReader is closed once compact() fails, covering the closeAllScratchQuietly path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses pr-reviewer iteration 2: Scratch.close() did `for (var s : gs) s.close()`, so a single GraphSearcher.close() (reader close()) throwing would leak the readers behind every remaining searcher in the same Scratch — a residual instance of the leak class this PR exists to fix. Scratch.close() now closes every GraphSearcher best-effort: each close() is wrapped in its own try/catch accumulating a firstError (later failures attached as suppressed), the first failure is rethrown after the loop, and null slots left by a partially-failed constructor are skipped. Also add testCompactionSurvivesReaderCloseFailure, which makes one source's worker-thread readers throw from close() and asserts every other source's readers are still closed after compact(). The test fails (4 readers never closed) against the previous Scratch.close(). NIT: the level-0 and upper-layer onComplete lambdas now wrap their IOException in UncheckedIOException instead of a bare RuntimeException. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OnDiskGraphIndexCompactor.compactLevels()leaks aRandomAccessReaderfor every source, on every executor worker thread, for every compaction.
compactLevels()runs its batch tasks onexecutor(which may becaller-owned). Each worker thread lazily creates its own
Scratchvia aThreadLocal, and everyScratchholds oneGraphSearcher— hence oneRandomAccessReader(viaGraphIndex.View) — per source. At the end,only the calling thread's
Scratchis closed:Every
Scratchcreated on anexecutorworker thread is never closed,so its
GraphSearchers — and theRandomAccessReaders behind theirViews — leak for the lifetime of the executor (and accumulate acrosscompaction cycles when the executor is shared).
With an in-memory/mmap reader this is a benign FD/handle leak. With a
remote-backed reader (HerdDB's
RemoteRandomAccessReader) each leakedreader pins an off-heap Netty block buffer; over many compaction cycles
this exhausts the memory budget and stalls ingestion — see
eolivelli/herddb#590,
whose
SEVERE: LEAK: ByteBuf.release() was not calledtraces alloriginate in this code path.
A second, smaller leak:
resolveEntryNodeSource()calledsources.get(s).getView()and never closed the returnedView.Changes
OnDiskGraphIndexCompactor.compactLevels(): register everyScratchcreated on any thread in a
ConcurrentLinkedQueueand close them allin a
finallyblock. On normal completion a close failure issurfaced; while a primary exception is propagating, close failures are
logged so they cannot mask the original cause.
OnDiskGraphIndexCompactor.resolveEntryNodeSource(): wrap thegetView()call in try-with-resources.testCompactionClosesAllReaders: wraps each source'sReaderSupplierso every vendedRandomAccessReaderrecords whetherit was closed, runs a compaction on a caller-owned multi-threaded
ForkJoinPool, and asserts no reader is left open oncecompact()returns. The test fails (
5 still open) onmainand passes withthe fix.
Test plan
testCompactionClosesAllReaderspasses with the fix, fails without itTestOnDiskGraphIndexCompactorsuite (9 tests) passes🤖 Generated with Claude Code