Parallelize PQRetrainer training-vector extraction across sources#12
Merged
Conversation
PQRetrainer.extractVectorsSequential read every training sample with a single-threaded blocking getVectorInto() loop. Against remote storage each read is a network round-trip with no OS read-ahead, so thousands of samples serialize into thousands of round-trips — observed as a 2+ hour stall during a 53-segment compaction (HerdDB issue datastax#587). Extract each source on its own thread/View (one RandomAccessReader per View, never shared) so up to jvector.pq.retrain.io.threads (default 16) remote reads are in flight at once; within a source reads stay ascending for read-ahead friendliness. Also close the previously-leaked Views and emit periodic progress logs so a slow extraction is distinguishable from a hang. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eolivelli
added a commit
to eolivelli/herddb
that referenced
this pull request
May 18, 2026
Fixes #587. A vector-index compaction cycle that selected **53 input segments** stalled the Indexing Service for 2+ hours. The downstream PQ-retraining step (jvector `PQRetrainer.extractVectorsSequential`) samples training vectors per input segment with random remote-storage reads, so its I/O cost scales with the number of input segments. `VectorIndexCompactor.chooseSegmentsToMerge` bounded the picked set only by a byte cap, which never bites when many segments are individually small — leaving the input count effectively unbounded. This PR is the **HerdDB-side mitigation**. The complementary jvector-side fix (parallelizing the per-source extraction so the per-read latency is hidden) is in **eolivelli/jvector#12**. The two changes are independent — HerdDB CI builds jvector from `eolivelli/jvector` `main` and is unaffected by the jvector PR's merge state — but full latency relief needs both. ## Changes - `VectorIndexCompactor` — new 7-arg `chooseSegmentsToMerge` overload with a `maxInputs` parameter (6-arg overload delegates with the cap disabled). After the fire/no-fire trigger decision, the normal byte-capped selection is truncated smallest-first to at most `maxInputs` segments, with an INFO log when truncation happens. The micro-segment fast path (#570) is **deliberately exempt** — those cycles must stay fast slot-reclaiming merges and the PQ-retraining-I/O concern does not apply to them. Added `clampMaxInputs` (`<=0` disables, `1`→`2`) and `computeTieredMaxInputs`. - `PersistentVectorStore` — `DEFAULT_VECTOR_INDEX_COMPACTION_MAX_INPUTS = 16`, a `vectorIndexCompactionMaxInputs` field, `setCompactionMaxInputs`/`getCompactionMaxInputs`. The base cap is **tier-scaled** (2×/4×/8× at 100/300/500 segments) per cycle alongside the byte/count caps, so the per-cycle drain rate rises with the backlog and the cap cannot starve the tailer toward the back-pressure threshold. The cycle still fires on the same triggers and merges leftover segments in subsequent cycles. - `IndexingServerConfiguration` / `IndexingServiceEngine` — new `vector.index.compaction.maxInputs` config key (default 16), wired into the store and the startup config log. ## Tests - `VectorIndexCompactorChooseTest` — new cases: a 53-segment pick is truncated to the 16 smallest in order; `maxInputs=0` disables the cap; the cap never changes the fire/no-fire trigger decision; a picked set within the cap is returned untruncated; the micro-segment fast-path result is **not** capped; `clampMaxInputs` normalisation. - `Issue587CompactionInputCapTest` — new end-to-end test: builds a 50-segment backlog with the cap enabled at its default, drives multiple compaction cycles, and asserts every cycle merges at most the cap and the segment count strictly converges (no starvation). - `Issue354TieredCompactionTest` — new `computeTieredMaxInputs` unit tests (scaling, overflow, disabled-cap); the two end-to-end tiered tests disable the orthogonal input cap so their "drain the whole backlog in one cycle" premise still holds. - Pre-PR validation green: `spotless:check apache-rat:check install -DskipTests spotbugs:check -Pci` (the exact CI gate). - Hammer suite green (twice): `DirectMultipleConcurrentUpdatesSuite{NoIndexes,WithNonUniqueIndexes,WithUniqueIndexes}Test`, `BLinkConcurrentSearchInsertTest`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PQRetrainer.extractVectorsSequentialextracted every PQ training sample with a single-threaded, blockinggetVectorInto()loop. Against remote-backed graph storage each read is a network round-trip with no OS read-ahead, so thousands of samples serialize into thousands of sequential round-trips.OnDiskGraphIndex.View(oneRandomAccessReaderperView— never shared), so up tojvector.pq.retrain.io.threads(default 16) remote reads are in flight at once. Reads within a source stay ascending, preserving read-ahead friendliness.Views that the old code leaked, and emits periodic progress logs so a slow extraction can be distinguished from a hang.Tests
TestOnDiskGraphIndexCompactor#testCompactManySourcesParallelRetraincompacts 8 FusedPQ sources (exercising the parallel path) and asserts every source's inline vectors survive compaction exactly at their remapped ordinals.TestOnDiskGraphIndexCompactorsuite passes (8 tests).🤖 Generated with Claude Code