You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow-up to #616 (items 1 and 2). The Phase B checkpoint guard from #616 was landed in #618; this issue tracks the remaining compaction-observability improvements that #618 deliberately deferred. Item 3 of #616 (automatic exclusion of READ_IO-failed segments + describe-index exposure) is out of scope here.
Background
During a 100M BIGANN vector benchmark in gRPC push mode, the IS accumulated 594 segments and the compaction cycle began retrying every 10 minutes against a corrupt segment without operators being able to identify which segment was failing from the logs.
The compaction cycle today logs only aggregate counts:
INFO: vector store vidx: tiered compaction — 594 segments → effective maxBytes=…
INFO: vector store vidx: maxInputBytes cap (4,294,967,296 bytes) trimmed candidates to 83 segment(s) (4,269,873,288 bytes)
INFO: vector store vidx: starting graph-merge compaction (83 candidate segments)
WARNING: vector store vidx: compaction failed (READ_IO)
…streaming compaction: failed to prepare source segment graph for vidx_vector_bench_603989658950146_seg627
There is no per-segment inventory, no list of the 83 chosen candidates, and no per-segment selection reason. When a corrupt segment is in the selected set, there is no log trail to explain why it was included.
Requested improvements
1. Segment inventory log at compaction start
Before running the selection algorithm in PersistentVectorStore.runCompactionCycle, log:
INFO summary: vector store <indexName>: segment inventory — <N> segments, <totalLiveVectors> vectors total
FINE per-segment (gated on LOGGER.isLoggable(Level.FINE) so the per-segment loop has near-zero cost in production): one line per segment with segUuid, liveCount, estimatedSizeBytes, graphFileSize, generation. (No proactive S3 HEAD check — the corrupted-segment quarantine in item 3 of [IS] Add segment inventory and candidate-selection logging to compaction cycle #616, when implemented, will break the retry loop without extra S3 traffic. A HEAD-based minioOk flag can be added later if operators want it.)
2. Candidate selection trace log
After VectorIndexCompactor.chooseSegmentsToMerge returns and after the maxInputBytes cap has trimmed the list, log the chosen candidates:
INFO summary: already present (starting graph-merge compaction (<N> candidate segments)) — extend it slightly to include the total bytes selected.
FINE per-candidate: one line per chosen candidate with segUuid, liveCount, graphFileSize, selection-reason tag (tier-0, micro-segment, byte-cap-trimmed, etc.).
This makes it possible to grep the IS log stream for which segments were picked in each cycle, and to verify the selection policy from production logs.
Proactive S3 HEAD checks of segment graph files. The retry-loop symptom from the original report is best addressed by quarantine (item 3, out of scope) rather than per-cycle HEAD-request fan-out.
Implementation notes
All changes live in herddb-indexing-service/src/main/java/herddb/indexing/vector/PersistentVectorStore.java, around runCompactionCycle (lines ~2431–2570 in the current master).
The per-segment / per-candidate loops MUST be guarded by LOGGER.isLoggable(Level.FINE) — at 594 segments per cycle the unguarded build would burn measurable CPU even when the handler discards the records.
The segUuid is computed by segmentStorageKey(VectorSegment) (line 1147) — reuse, do not duplicate the formatting.
Tests
A plain unit test (CompactionInventoryLoggingTest in herddb-indexing-service/src/test/java/herddb/indexing/vector/) that:
Spins up a PersistentVectorStore with MemoryDataStorageManager.
Writes a few real segments via addVector + checkpoint.
Attaches a JUL handler at Level.ALL to PersistentVectorStore.LOGGER.
Calls runCompactionCycle().
Asserts the INFO inventory line is emitted with the right segment count, and that the FINE per-segment / per-candidate lines name the expected segUuids.
No new @Category(ClusterTest.class) required — pure unit-test path.
Follow-up to #616 (items 1 and 2). The Phase B checkpoint guard from #616 was landed in #618; this issue tracks the remaining compaction-observability improvements that #618 deliberately deferred. Item 3 of #616 (automatic exclusion of
READ_IO-failed segments +describe-indexexposure) is out of scope here.Background
During a 100M BIGANN vector benchmark in gRPC push mode, the IS accumulated 594 segments and the compaction cycle began retrying every 10 minutes against a corrupt segment without operators being able to identify which segment was failing from the logs.
The compaction cycle today logs only aggregate counts:
There is no per-segment inventory, no list of the 83 chosen candidates, and no per-segment selection reason. When a corrupt segment is in the selected set, there is no log trail to explain why it was included.
Requested improvements
1. Segment inventory log at compaction start
Before running the selection algorithm in
PersistentVectorStore.runCompactionCycle, log:vector store <indexName>: segment inventory — <N> segments, <totalLiveVectors> vectors totalLOGGER.isLoggable(Level.FINE)so the per-segment loop has near-zero cost in production): one line per segment withsegUuid,liveCount,estimatedSizeBytes,graphFileSize,generation. (No proactive S3 HEAD check — the corrupted-segment quarantine in item 3 of [IS] Add segment inventory and candidate-selection logging to compaction cycle #616, when implemented, will break the retry loop without extra S3 traffic. A HEAD-basedminioOkflag can be added later if operators want it.)2. Candidate selection trace log
After
VectorIndexCompactor.chooseSegmentsToMergereturns and after themaxInputBytescap has trimmed the list, log the chosen candidates:starting graph-merge compaction (<N> candidate segments)) — extend it slightly to include the total bytes selected.segUuid,liveCount,graphFileSize, selection-reason tag (tier-0,micro-segment,byte-cap-trimmed, etc.).This makes it possible to grep the IS log stream for which segments were picked in each cycle, and to verify the selection policy from production logs.
Out of scope
READ_IO-failed segments +describe-indexexposure. Not pursued.Implementation notes
herddb-indexing-service/src/main/java/herddb/indexing/vector/PersistentVectorStore.java, aroundrunCompactionCycle(lines ~2431–2570 in the current master).LOGGER.isLoggable(Level.FINE)— at 594 segments per cycle the unguarded build would burn measurable CPU even when the handler discards the records.segUuidis computed bysegmentStorageKey(VectorSegment)(line 1147) — reuse, do not duplicate the formatting.Tests
A plain unit test (
CompactionInventoryLoggingTestinherddb-indexing-service/src/test/java/herddb/indexing/vector/) that:PersistentVectorStorewithMemoryDataStorageManager.addVector+checkpoint.Level.ALLtoPersistentVectorStore.LOGGER.runCompactionCycle().segUuids.No new
@Category(ClusterTest.class)required — pure unit-test path.