Skip to content

issue #599: bypass block-cache for PQ retraining via local segment file download#601

Merged
eolivelli merged 2 commits into
masterfrom
issue-599-pqretrainer-sequential-streaming-remote
May 19, 2026
Merged

issue #599: bypass block-cache for PQ retraining via local segment file download#601
eolivelli merged 2 commits into
masterfrom
issue-599-pqretrainer-sequential-streaming-remote

Conversation

@eolivelli
Copy link
Copy Markdown
Owner

Fixes #599.

Changes

  • DeleteOnCloseReaderSupplier (new): wraps a ReaderSupplier and deletes the backing temp file on close(); ensures temp segment files downloaded for PQ retraining are cleaned up after use
  • SegmentPQReaderSupplier (new): builds the Function<OnDiskGraphIndex,ReaderSupplier> factory injected into OnDiskGraphIndexCompactor; for each source segment, either downloads the graph file directly from object storage (when DataStorageManager.supportsDirectMultipartDownload() is true) or opens a sequential reader via DataStorageManager.multipartIndexReaderSupplier(); temp files live in store.tmpDirectory() (= IS data directory)
  • VectorIndexCompactor: wire SegmentPQReaderSupplier.forSegments() into rebuildSegmentStreaming() via the new 6-arg OnDiskGraphIndexCompactor constructor (introduced in issue #599: add readerSupplierFactory to PQRetrainer / OnDiskGraphIndexCompactor jvector#14); add PQ_BULK_READER_COUNT AtomicInteger counter for test observability
  • PersistentVectorStore: expose segmentStorageKey(), tableSpaceUUID(), and dataStorageManager() as package-private accessors needed by SegmentPQReaderSupplier

Tests

  • VectorIndexStreamingCompactionTest#streamingCompactionUsesBulkPQReaderSupplier: creates a store with MemoryDataStorageManager (dim=16, 300 vectors × 5 checkpoints so FusedPQ is enabled), calls store.runCompactionCycle(), and asserts that PQ_BULK_READER_COUNT increased by ≥ 2 (one per source segment) and that a search on the resulting index returns non-empty results

Depends on

eolivelli/jvector#14 — the 6-arg OnDiskGraphIndexCompactor constructor and OnDiskGraphIndex.getView(ReaderSupplier) overload

🤖 Implemented by the pr-worker agent.

eolivelli and others added 2 commits May 19, 2026 14:54
…nt files locally

During streaming compaction, PQRetrainer.extractVectorsSequential() called
getVectorInto() once per sampled node, each of which issued a gRPC
block-cache read over remote storage. For 13 source segments × ~4 096
samples each, this serialized ~53 248 network round-trips.

Option B: download each source graph file once to a temp file in the IS
data directory (store.tmpDirectory()), then open all PQ-retraining Views
from that local file. When the DSM supports direct object-storage download
(supportsDirectMultipartDownload=true), the temp file is written without
going through the gRPC file server at all; otherwise, a single sequential
DSM read buffers the entire file for the retrainer.

New classes:
- DeleteOnCloseReaderSupplier: wraps a ReaderSupplier and deletes the
  backing temp file on close()
- SegmentPQReaderSupplier: builds the Function<OnDiskGraphIndex,ReaderSupplier>
  factory used by OnDiskGraphIndexCompactor (via the new 6-arg constructor
  introduced in eolivelli/jvector #14)

Changes to existing classes:
- VectorIndexCompactor: wire SegmentPQReaderSupplier.forSegments() into
  rebuildSegmentStreaming(); add PQ_BULK_READER_COUNT counter for test
  observability
- PersistentVectorStore: expose segmentStorageKey(), tableSpaceUUID(), and
  dataStorageManager() as package-private accessors for SegmentPQReaderSupplier

Tests:
- VectorIndexStreamingCompactionTest#streamingCompactionUsesBulkPQReaderSupplier:
  exercises the full runCompactionCycle() → rebuildSegmentStreaming() →
  PQRetrainer path with MemoryDataStorageManager and verifies that the
  bulk reader factory is invoked at least twice (once per source segment)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- SegmentPQReaderSupplier: add candidates/sources size mismatch precondition
  check (IllegalArgumentException) and graphFileSize <= 0 guard
  (IllegalStateException); remove redundant deleteSilently from catch blocks
  (only in finally); replace bare RuntimeException with IllegalStateException
- VectorIndexCompactor: move PQ_BULK_READER_COUNT increment to after
  pqReaderFactory.apply() returns successfully; use short AtomicInteger name
- SegmentPQReaderSupplierTest: new test class covering sizeMismatchRejected,
  zeroGraphFileSizeRejected, directDownloadFastPathSuccessCleansTempFile,
  and directDownloadFastPathFailureCleansTempFile — the last two use a
  DirectDownloadDsm that returns supportsDirectMultipartDownload()=true and
  only injects failures for fileType="graph" (PQ path), not "map" (checkpoint)
- VectorIndexStreamingCompactionTest: update Javadoc for
  streamingCompactionUsesBulkPQReaderSupplier to reflect that the counter
  reflects successful reader acquisitions, not just factory invocations

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@eolivelli eolivelli merged commit 4062e02 into master May 19, 2026
19 of 23 checks passed
@eolivelli eolivelli deleted the issue-599-pqretrainer-sequential-streaming-remote branch May 19, 2026 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PQRetrainer.extractVectorsSequential: use large sequential buffer + cache bypass instead of 16 KiB block-cache reads

1 participant