issue #621: catch Error in compaction rebuild + cycle so Netty OOM does not leak orphan multipart blocks by eolivelli · Pull Request #624 · eolivelli/herddb

eolivelli · 2026-05-20T18:12:49Z

Fixes #621.

Symmetric compaction-side fix for the bug class that #616 / #618 fixed on the checkpoint side. Netty surfaces direct-buffer allocation failures as OutOfDirectMemoryError extends OutOfMemoryError extends Error. Several catch arms on the compaction code path were narrower than the Error class hierarchy, so a Netty OOM mid multipart-upload during compaction:

Bypassed the orphan-list wrap into CompactionException.
The outer runCompactionCycle catch arms never saw the orphans → partial multipart blocks leaked in remote storage indefinitely.
The Error propagated up to vectorIndexCompactionLoop's catch (RuntimeException) safety net — also too narrow — and killed the compaction daemon thread, stopping compaction until IS restart.

Changes

Production-only (no new tests per direction):

VectorIndexCompactor.rebuildSegmentStreaming — broaden four catch arms to include Error:
- OnDiskGraphIndexCompactor constructor + compact() (line ~1301)
- writeStreamingCompactedMapFile (line ~1315)
- writeStreamingCompactedSegment (line ~1325) — the main case from the issue: a Netty OOM mid multipart-write now produces the orphan-list wrap that runCompactionCycle's catch (CompactionException) arm drains into pendingDeletes
- preloadCompactedSegment (line ~1367) — post-upload, with orphans known
VectorIndexCompactor.rebuildSegmentLegacy — broaden the matching three catch arms (builder.cleanup, writeSyntheticShard, preloadCompactedSegment) for symmetry with the streaming path.
PersistentVectorStore.vectorIndexCompactionLoop — broaden the safety-net catch from RuntimeException to RuntimeException | Error. This is the daemon-thread lifecycle catch (analogous to the outer Phase B broadening in issue #616: Phase B checkpoint guard — pre-track multipart files before upload #618). The catch already logs SEVERE, bumps compactionConsecutiveFailures, and lets the loop's computeBackoffMs back-off keep a persistent Error source from spinning the thread. We do NOT rethrow Error here — it would end the only retry path the compaction subsystem has.

Each broadened arm carries an inline comment per CLAUDE.md guidance explaining why the broad Error catch is required. No Error is ever silently swallowed: every arm either rethrows wrapped as CompactionException (per-step catches) or rethrows nothing while logging SEVERE + bumping the failure counter (the daemon safety net).

Tests

Skipped per user direction. The regression batch below was run to confirm nothing existing broke.

Test plan

mvn -pl herddb-indexing-service -Dtest='Issue354TieredCompactionTest,Issue285SegmentCountCompactionTest,Issue570MicroSegmentCompactionTest,Issue587CompactionInputCapTest,EagerDownloadCompactionTest,Issue535DropSegmentByUuidRaceTest,Issue256CompactionReleaseTest,CompactionDeprecateInputsFailureSafetyTest,Issue616PhaseBPartialUploadRollbackTest,RemoteSegmentGraphMergerStreamingTest,RemoteSegmentGraphMergerUploadFailureTest,RemoteSegmentGraphMergerCorruptInputTest' test — 49 tests across the existing compaction + Phase B regression set, all pass.
mvn -B spotless:apply -DskipTests then mvn -B spotless:check apache-rat:check spotbugs:check install -DskipTests -Pci — green.

🤖 Implemented by the pr-worker agent.

…es not leak orphan multipart blocks Symmetric compaction-side fix for the bug class that issue #616 / #618 fixed on the checkpoint side. Netty surfaces direct-buffer allocation failures as OutOfDirectMemoryError extends OutOfMemoryError extends Error. Several catch arms on the compaction code path were narrower than the Error class hierarchy, so a Netty OOM mid multipart-upload during compaction bypassed the orphan-list wrap into CompactionException -> runCompactionCycle's catch arm never saw the orphans -> partial blocks leaked in remote storage. The same Error escape also killed the compaction daemon thread (the safety net was catch (RuntimeException)), stopping compaction until IS restart. Production-only changes — no new tests per direction: * VectorIndexCompactor.rebuildSegmentStreaming: broaden 4 catch arms to include Error (OnDiskGraphIndexCompactor ctor/compact, map-file build, writeStreamingCompactedSegment, preloadCompactedSegment). Each still wraps the failure as a CompactionException so the outer cycle's orphan-list pipeline runs. The orphan list is the SAME under Error as under IOException — partial multipart blocks are queued into pendingDeletes for retention-aware cleanup. * VectorIndexCompactor.rebuildSegmentLegacy: broaden the matching 3 catch arms (builder.cleanup, writeSyntheticShard, preloadCompactedSegment) for symmetry with the streaming path. * PersistentVectorStore.vectorIndexCompactionLoop: broaden the safety-net catch from RuntimeException to RuntimeException | Error. This is the daemon-thread lifecycle catch — analogous to the outer Phase B broadening in #618. The catch already logs SEVERE, bumps compactionConsecutiveFailures, and lets the loop's computeBackoffMs back-off keep a persistent Error source from spinning the thread. We do NOT rethrow Error: it would end the only retry path the compaction subsystem has. Each broadened arm carries an inline comment per CLAUDE.md guidance explaining why the broad Error catch is required. No Error is ever silently swallowed — every arm either rethrows wrapped as CompactionException (for the per-step catches) or rethrows nothing while logging SEVERE + bumping the failure counter (for the daemon safety net). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… symmetric Error metrics buckets Addresses the pr-reviewer findings on PR #624 (issue #621): 1. (BLOCK) PersistentVectorStore.runCompactionCycle: add a third catch arm `catch (Error e)` AFTER catch (DataStorageManagerException). The per-step Error wraps in VectorIndexCompactor's rebuild paths close the leak when the failure surfaces INSIDE the rebuild, but the post-rebuild section (lateDeletes.forEach, warmUpNewSegmentsBeforePublish, atomicSwapCompactionResult) runs AFTER rebuildSegment returns and is one frame above the patched arms. A Netty OutOfDirectMemoryError from warmUpSegmentBfs's SegmentBlockCache reads or from stageNewSegments's ZK / publisher boundary would bypass the existing catch (CompactionException) arm and leak the already-uploaded merged-output multipart blocks. The new arm drains rebuild.orphanPaths into pendingDeletes (mirroring the CompactionException arm), bumps the WRITE_IO per-reason counter, logs SEVERE, then rethrows so the daemon safety net handles the consecutive-failures bookkeeping uniformly with other Error escapes. We rethrow rather than swallow because the JVM may be in a degraded state where the next cycle is not safe to start without the loop's back-off pause. 2. (BLOCK) PersistentVectorStore.atomicSwapCompactionResult: broaden catch (RuntimeException stageFailed) to RuntimeException | Error and pass the original throwable as the CompactionException cause so the already-uploaded merged-output multipart files still drive queueMergedOutputForDeletion when stageNewSegments throws an Error. 3. (REQUEST_CHANGES) VectorIndexCompactor close-on-failure catches: broaden catch (RuntimeException) / catch (IOException) at the merged-segment / builder close()s in rebuildSegmentStreaming and rebuildSegmentLegacy to also catch Error. Without this, an Error from close() can suppress an in-flight CompactionException carrying the orphan list, after which runCompactionCycle's catch arm never sees the orphans and partial multipart files leak. 4. (REQUEST_CHANGES, metrics) VectorIndexCompactor.rebuildSegmentStreaming download-stage catch and populateSyntheticShard catches: broaden to include Error so failures are bucketed by FailureReason (READ_IO / CORRUPTION / WRITE_IO) instead of being attributed by the daemon safety net to "unexpected compaction failure" with no per-reason metric. Same broadening on the view.close() in the populateSyntheticShard finally so a close-time Error does not suppress an in-flight CompactionException. 5. (NIT) Qualify the inline comment around the streaming "orphans" list prepopulation to make explicit that the orphan-routing guarantee depends on every downstream catch staying Error-aware — pointing future contributors at both the per-step wraps in this file and the post-rebuild catch (Error) arm in runCompactionCycle. No new tests per direction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ble counter bump + remaining Error-suppressing close Addresses the pr-reviewer's iteration-2 BLOCK findings on PR #624: 1. (BLOCK) PersistentVectorStore.runCompactionCycle: the iteration-1 catch (Error) arm drained rebuild.orphanPaths unconditionally, which would silently delete the LIVE merged segment's files if an Error fires AFTER atomicSwapCompactionResult publishes (commitStagedSegments, deprecateInputs, post-publish best-effort work). Add a `boolean published` flag flipped to true the instant atomicSwapCompactionResult returns successfully; the orphan drain in the new catch (Error) arm now only runs when !published. When published, the merged segment is live in this.segments and the files MUST NOT be queued for retention deletion. The post-publish Error path then matches the pre-#624 behaviour: rethrow → daemon safety net logs SEVERE + bumps the counter; ZK reconcile-on-restart heals any registry hiccup; the published segment stays intact. 2. (BLOCK) PersistentVectorStore.runCompactionCycle catch (Error) arm: drop the recordCompactionFailure(WRITE_IO) call so the daemon safety net at vectorIndexCompactionLoop:2416 is the SINGLE place that bumps compactionConsecutiveFailures + logs SEVERE for Error escapes. Pre-fix, both fired and the loop's computeBackoffMs saw double the consecutive failures on every Error. 3. (BLOCK) PersistentVectorStore.atomicSwapCompactionResult: broaden catch (RuntimeException) to RuntimeException | Error on BOTH commitStagedSegments and deprecateInputs (the two post-publish best-effort ZK calls). Pre-fix an Error from either would escape atomicSwapCompactionResult AFTER publish; the caller's catch (Error) arm would then misinterpret the live segment's orphan paths as deletable. With these catches Error-aware, the post-publish recovery shape is identical to the existing RuntimeException case (log WARNING, reconcile-on-restart heals). 4. (BLOCK) VectorIndexCompactor.rebuildSegmentStreaming outermost finally: broaden the ReaderSupplier.close() catch from IOException | RuntimeException to + Error. This finally runs AFTER the inner mergedSegment.close() finally on both success and failure paths; on the failure path an Error from a source-graph supplier close (memory-mapped Arena cleanup, native cleanup hook) would suppress an in-flight CompactionException carrying the orphan list — the same close-suppression pattern the iteration-1 close arms already addressed in the inner finallys. No new tests per direction. 50-test regression batch passes; pre-PR validation green. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…pCompactionResult Addresses the pr-reviewer iteration-3 BLOCK on PR #624: the published-flag gate in runCompactionCycle was correct, but two post-publish code paths inside atomicSwapCompactionResult could still leak an Error back to the caller with `published == false`, where the caller's catch (Error) arm would erroneously drain rebuild.orphanPaths for the already-live merged segment — same silent-data-loss shape. * in.close() loop at line 3361 — runs OUTSIDE the writeLock immediately after the volatile publish (this.segments = ...). Closes the compacted input segments, which can in principle raise an Error from BLink page release or OnDiskGraph mmap Arena cleanup. The existing catch (RuntimeException) is documented as a broad-by-necessity catch ("must not let the swap fail afterwards"); broaden it to RuntimeException | Error for the same reason. * queueSegmentPendingDelete loop at line 3507 — runs even later, after the post-publish ZK coordination. Wrap the for-loop in try { ... } catch (RuntimeException | Error e) { LOGGER.log(WARNING, ...); }. The input segments stay in MinIO as orphans-but-intact if the queue fails (the optimizer's orphan-ACTIVE fold or a manual reaper reclaims them on the next pass); we MUST NOT let the Error escape to runCompactionCycle where it would interpret the live merged segment's files as deletable orphans. 50-test indexing-service regression batch passes; pre-PR validation green. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eolivelli and others added 4 commits May 20, 2026 20:12

eolivelli merged commit d21efda into master May 20, 2026
8 checks passed

eolivelli deleted the issue-621-is-compaction-error-handling-catch-throw branch May 20, 2026 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue #621: catch Error in compaction rebuild + cycle so Netty OOM does not leak orphan multipart blocks#624

issue #621: catch Error in compaction rebuild + cycle so Netty OOM does not leak orphan multipart blocks#624
eolivelli merged 4 commits into
masterfrom
issue-621-is-compaction-error-handling-catch-throw

eolivelli commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eolivelli commented May 20, 2026

Changes

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant