Parallelize PQRetrainer training-vector extraction across sources by eolivelli · Pull Request #12 · eolivelli/jvector

eolivelli · 2026-05-18T12:58:37Z

Summary

PQRetrainer.extractVectorsSequential extracted every PQ training sample with a single-threaded, blocking getVectorInto() loop. Against remote-backed graph storage each read is a network round-trip with no OS read-ahead, so thousands of samples serialize into thousands of sequential round-trips.
Observed downstream as a 2+ hour stall during a 53-segment compaction in HerdDB ([k3s-bench] PQRetrainer.extractVectorsSequential: 2+ hour stall due to sequential random I/O over remote file server herddb#587).
Each source is now extracted on its own worker thread using its own OnDiskGraphIndex.View (one RandomAccessReader per View — never shared), so up to jvector.pq.retrain.io.threads (default 16) remote reads are in flight at once. Reads within a source stay ascending, preserving read-ahead friendliness.
Also closes the Views that the old code leaked, and emits periodic progress logs so a slow extraction can be distinguished from a hang.

Tests

New TestOnDiskGraphIndexCompactor#testCompactManySourcesParallelRetrain compacts 8 FusedPQ sources (exercising the parallel path) and asserts every source's inline vectors survive compaction exactly at their remapped ordinals.
Full TestOnDiskGraphIndexCompactor suite passes (8 tests).

🤖 Generated with Claude Code

PQRetrainer.extractVectorsSequential read every training sample with a single-threaded blocking getVectorInto() loop. Against remote storage each read is a network round-trip with no OS read-ahead, so thousands of samples serialize into thousands of round-trips — observed as a 2+ hour stall during a 53-segment compaction (HerdDB issue datastax#587). Extract each source on its own thread/View (one RandomAccessReader per View, never shared) so up to jvector.pq.retrain.io.threads (default 16) remote reads are in flight at once; within a source reads stay ascending for read-ahead friendliness. Also close the previously-leaked Views and emit periodic progress logs so a slow extraction is distinguishable from a hang. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fixes #587. A vector-index compaction cycle that selected **53 input segments** stalled the Indexing Service for 2+ hours. The downstream PQ-retraining step (jvector `PQRetrainer.extractVectorsSequential`) samples training vectors per input segment with random remote-storage reads, so its I/O cost scales with the number of input segments. `VectorIndexCompactor.chooseSegmentsToMerge` bounded the picked set only by a byte cap, which never bites when many segments are individually small — leaving the input count effectively unbounded. This PR is the **HerdDB-side mitigation**. The complementary jvector-side fix (parallelizing the per-source extraction so the per-read latency is hidden) is in **eolivelli/jvector#12**. The two changes are independent — HerdDB CI builds jvector from `eolivelli/jvector` `main` and is unaffected by the jvector PR's merge state — but full latency relief needs both. ## Changes - `VectorIndexCompactor` — new 7-arg `chooseSegmentsToMerge` overload with a `maxInputs` parameter (6-arg overload delegates with the cap disabled). After the fire/no-fire trigger decision, the normal byte-capped selection is truncated smallest-first to at most `maxInputs` segments, with an INFO log when truncation happens. The micro-segment fast path (#570) is **deliberately exempt** — those cycles must stay fast slot-reclaiming merges and the PQ-retraining-I/O concern does not apply to them. Added `clampMaxInputs` (`<=0` disables, `1`→`2`) and `computeTieredMaxInputs`. - `PersistentVectorStore` — `DEFAULT_VECTOR_INDEX_COMPACTION_MAX_INPUTS = 16`, a `vectorIndexCompactionMaxInputs` field, `setCompactionMaxInputs`/`getCompactionMaxInputs`. The base cap is **tier-scaled** (2×/4×/8× at 100/300/500 segments) per cycle alongside the byte/count caps, so the per-cycle drain rate rises with the backlog and the cap cannot starve the tailer toward the back-pressure threshold. The cycle still fires on the same triggers and merges leftover segments in subsequent cycles. - `IndexingServerConfiguration` / `IndexingServiceEngine` — new `vector.index.compaction.maxInputs` config key (default 16), wired into the store and the startup config log. ## Tests - `VectorIndexCompactorChooseTest` — new cases: a 53-segment pick is truncated to the 16 smallest in order; `maxInputs=0` disables the cap; the cap never changes the fire/no-fire trigger decision; a picked set within the cap is returned untruncated; the micro-segment fast-path result is **not** capped; `clampMaxInputs` normalisation. - `Issue587CompactionInputCapTest` — new end-to-end test: builds a 50-segment backlog with the cap enabled at its default, drives multiple compaction cycles, and asserts every cycle merges at most the cap and the segment count strictly converges (no starvation). - `Issue354TieredCompactionTest` — new `computeTieredMaxInputs` unit tests (scaling, overflow, disabled-cap); the two end-to-end tiered tests disable the orthogonal input cap so their "drain the whole backlog in one cycle" premise still holds. - Pre-PR validation green: `spotless:check apache-rat:check install -DskipTests spotbugs:check -Pci` (the exact CI gate). - Hammer suite green (twice): `DirectMultipleConcurrentUpdatesSuite{NoIndexes,WithNonUniqueIndexes,WithUniqueIndexes}Test`, `BLinkConcurrentSearchInsertTest`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eolivelli mentioned this pull request May 18, 2026

issue #587: cap input segments per vector-index compaction cycle eolivelli/herddb#589

Merged

eolivelli merged commit b9fbe52 into main May 18, 2026
4 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize PQRetrainer training-vector extraction across sources#12

Parallelize PQRetrainer training-vector extraction across sources#12
eolivelli merged 1 commit into
mainfrom
issue-587-pq-retrain-parallel-io

eolivelli commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eolivelli commented May 18, 2026

Summary

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant