Summary
CPU profiling of a graph-index build/compaction workload shows AbstractLongHeap.upHeap consuming ~43% of CPU — more than the actual vector distance computations (squareDistance, ~32%).
| Method |
Self CPU |
AbstractLongHeap.upHeap |
43.1% |
PanamaVectorUtilSupport.squareDistancePreferred (real vector math) |
32.3% |
IntHashSet.add (visited set) |
8.5% |
Call path: GraphIndexBuilder.addGraphNode → GraphSearcher.resume → searchLayer0 → searchOneLayer → neighbor callback → NodeQueue.push → GrowableLongHeap.push → AbstractLongHeap.add → upHeap. All upHeap samples land on the sift-up loop.
Root cause
GraphSearcher.candidates is an unbounded GrowableLongHeap (MAX-heap). The neighbor callback in searchOneLayer pushes every scored neighbor onto it unconditionally — roughly maxDegree (~32) pushes per expanded node against a single pop. Within one search the heap grows to many thousands of entries (an 80 KB–800 KB long[]). Each push then performs an O(log N) sift-up that walks parent indices i, i/2, i/4 … — addresses scattered far apart in an array much larger than CPU cache, so the loop becomes cache-miss bound.
NodeQueue / AbstractLongHeap.upHeap themselves are fine — the sift-up is already an optimal implementation. The problem is the size of the heap, driven by the caller.
Why most of those pushes are wasted work
stopSearch ends the loop once the result set is full (approximateResults.size() >= rerankK) and the best remaining candidate is worse than the worst kept result (topCandidateScore < approximateResults.topScore()).
A candidate pushed with score < approximateResults.topScore() can only reach the top of the MAX-heap after every better candidate — by which point stopSearch has already fired. Since approximateResults.topScore() (the worst kept result) only ever rises, such a candidate is provably unexpandable for the rest of the search. Pushing it is pure heap bloat that slows every subsequent upHeap.
This is the standard HNSW frontier-pruning step, which jvector currently omits.
Proposed fix
Divert provably-unexpandable neighbors away from the hot candidates heap into the existing evictedResults buffer (a NodesUnsorted with O(1) append and no sift-up), which is already drained back into candidates on resume() / layer descent. This keeps resume() bit-exact while bounding the candidates heap to ~rerankK + maxDegree for non-resumed searches and graph builds.
Summary
CPU profiling of a graph-index build/compaction workload shows
AbstractLongHeap.upHeapconsuming ~43% of CPU — more than the actual vector distance computations (squareDistance, ~32%).AbstractLongHeap.upHeapPanamaVectorUtilSupport.squareDistancePreferred(real vector math)IntHashSet.add(visited set)Call path:
GraphIndexBuilder.addGraphNode→GraphSearcher.resume→searchLayer0→searchOneLayer→ neighbor callback →NodeQueue.push→GrowableLongHeap.push→AbstractLongHeap.add→upHeap. AllupHeapsamples land on the sift-up loop.Root cause
GraphSearcher.candidatesis an unboundedGrowableLongHeap(MAX-heap). The neighbor callback insearchOneLayerpushes every scored neighbor onto it unconditionally — roughlymaxDegree(~32) pushes per expanded node against a singlepop. Within one search the heap grows to many thousands of entries (an 80 KB–800 KBlong[]). Eachpushthen performs an O(log N) sift-up that walks parent indicesi, i/2, i/4 …— addresses scattered far apart in an array much larger than CPU cache, so the loop becomes cache-miss bound.NodeQueue/AbstractLongHeap.upHeapthemselves are fine — the sift-up is already an optimal implementation. The problem is the size of the heap, driven by the caller.Why most of those pushes are wasted work
stopSearchends the loop once the result set is full (approximateResults.size() >= rerankK) and the best remaining candidate is worse than the worst kept result (topCandidateScore < approximateResults.topScore()).A candidate pushed with
score < approximateResults.topScore()can only reach the top of the MAX-heap after every better candidate — by which pointstopSearchhas already fired. SinceapproximateResults.topScore()(the worst kept result) only ever rises, such a candidate is provably unexpandable for the rest of the search. Pushing it is pure heap bloat that slows every subsequentupHeap.This is the standard HNSW frontier-pruning step, which jvector currently omits.
Proposed fix
Divert provably-unexpandable neighbors away from the hot
candidatesheap into the existingevictedResultsbuffer (aNodesUnsortedwith O(1) append and no sift-up), which is already drained back intocandidatesonresume()/ layer descent. This keepsresume()bit-exact while bounding thecandidatesheap to ~rerankK + maxDegreefor non-resumed searches and graph builds.