Skip to content

Commit ecbb1bc

Browse files
fh-mszdenek-jonas
andauthored
GigaMap concurrency handling improvements and eventual indexing (#564)
* Bump jvector version to 4.0.0-rc.8 in pom.xml * Add parallel on-disk write support to VectorIndex * Introduce BackgroundIndexingManager for eventual consistency in VectorIndex - Added `BackgroundIndexingManager` interface to manage background graph indexing. - Implemented a default asynchronous queue-based manager for deferred HNSW operations. - Added unit tests to validate eventual indexing behavior across add, update, remove, and bulk operations. * Add concurrent stress tests for VectorIndex thread safety - Introduced stress test cases to validate VectorIndex under heavy concurrent operations. - Tests cover various configurations, including in-memory and on-disk setups with/without PQ compression and background tasks. - Ensured thread safety via assertions for no exceptions or deadlocks. - Included targeted eventual indexing and heavy load scenarios for robustness. * Add performance tests for synchronous vs eventual indexing in VectorIndex - Introduced test cases to evaluate insertion performance (single and batch adds) between synchronous and eventual indexing modes. - Verified search quality to ensure correctness of deferred graph indexing. - Included detailed metrics on caller-visible speedup and indexing throughput. * Refactor BackgroundIndexingManager to delegate operation handling via execute() - Replaced the applyOperation method with per-operation execute() implementations in IndexingOperation types. - Simplified indexing logic by encapsulating operation-specific behavior within each record. - Improved maintainability and reduced duplication in BackgroundIndexingManager. * Update default value for parallel on-disk write to false in VectorIndexConfiguration * Update tests for parallel on-disk write default change to false in VectorIndexConfiguration * Refactor VectorIndex synchronization and builder operation handling - Replaced `persistenceLock` with `builderLock` for unified read/write access control over builder operations. - Introduced deferred operation handling for sync-mode mutations during cleanup phases. - Improved thread-safety for concurrent graph updates by coordinating via builder read/write locks. - Added `cleanupInProgress` flag and `deferredBuilderOps` queue to manage in-flight operations during cleanup and persistence tasks. - Removed redundant `synchronized(parentMap)` calls to avoid lock-ordering issues. * Documented `eventualIndexing` and `parallelOnDiskWrite` features in VectorIndexConfiguration with examples and configuration details. * Document eventual indexing and parallel on-disk write features with examples and configuration guidance in VectorIndex JavaDoc. * Remove redundant default case check for unsupported similarity function in VectorIndex and add comment * Add braces to conditional statements in VectorEntry equals() for consistency * Consolidate BackgroundIndexingManager, BackgroundOptimizationManager, and BackgroundPersistenceManager into unified BackgroundTaskManager. * Gigamap jvector update tests updade (#569) * move configuration tests to VICT file, remove duplicite tests. * remove test, duplicate of testBackgroundOptimizationTriggersAfterIntervalAndThreshold * test refactoring * change indents - to see diff in PR * reverts original formats * add vector indices unit tests * remove dulicite tests --------- Co-authored-by: Zdenek Jonas <z.jonas@microstream.one>
1 parent 36a3d77 commit ecbb1bc

16 files changed

Lines changed: 4557 additions & 1271 deletions

docs/modules/gigamap/pages/indexing/jvector/configuration.adoc

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,10 @@ For datasets that exceed available memory, enable on-disk storage to use memory-
141141
|`pqSubspaces`
142142
|`0`
143143
|Number of PQ subspaces (0 = auto: dimension/4).
144+
145+
|`parallelOnDiskWrite`
146+
|`false`
147+
|Use parallel direct buffers and multiple worker threads for on-disk index writing. Speeds up persistence for large indices but uses more resources. Only applies when `onDisk=true`.
144148
|===
145149

146150
=== Example
@@ -157,6 +161,55 @@ VectorIndexConfiguration config = VectorIndexConfiguration.builder()
157161
.build();
158162
----
159163

164+
== Eventual Indexing
165+
166+
Enable eventual indexing to defer expensive HNSW graph mutations to a background thread, reducing mutation latency at the cost of eventual search consistency.
167+
168+
[options="header",cols="1,1,3"]
169+
|===
170+
|Parameter |Default |Description
171+
172+
|`eventualIndexing`
173+
|`false`
174+
|Defer HNSW graph mutations (add, update, remove) to a background thread. The vector store is updated synchronously, but graph construction happens asynchronously. Search results may not immediately reflect the most recent mutations.
175+
|===
176+
177+
When enabled:
178+
179+
* The vector store is always updated synchronously (no data loss).
180+
* HNSW graph mutations are queued and applied by a single background worker thread.
181+
* The queue is automatically drained before `optimize()`, `persistToDisk()`, and `close()`.
182+
183+
=== Example
184+
185+
[source, java]
186+
----
187+
VectorIndexConfiguration config = VectorIndexConfiguration.builder()
188+
.dimension(768)
189+
.similarityFunction(VectorSimilarityFunction.COSINE)
190+
.eventualIndexing(true)
191+
.build();
192+
----
193+
194+
== Parallel On-Disk Writes
195+
196+
When on-disk storage is enabled, persistence can optionally use parallel direct buffers and multiple worker threads (one per available processor) to write the index concurrently. This can significantly speed up persistence for large indices.
197+
198+
This is disabled by default, as sequential single-threaded writing is preferred in resource-constrained environments or for smaller indices.
199+
200+
=== Example
201+
202+
[source, java]
203+
----
204+
VectorIndexConfiguration config = VectorIndexConfiguration.builder()
205+
.dimension(768)
206+
.similarityFunction(VectorSimilarityFunction.COSINE)
207+
.onDisk(true)
208+
.indexDirectory(Path.of("/data/vectors"))
209+
.parallelOnDiskWrite(true)
210+
.build();
211+
----
212+
160213
== Background Persistence
161214

162215
Enable automatic asynchronous persistence to avoid blocking operations during writes.

gigamap/jvector/README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ A Java library that integrates [JVector](https://github.com/datastax/jvector) (h
1010
- **PQ Compression**: Product Quantization for reduced memory footprint
1111
- **Background Persistence**: Automatic asynchronous persistence at configurable intervals
1212
- **Background Optimization**: Periodic graph cleanup for improved query performance
13+
- **Eventual Indexing**: Deferred graph mutations via background thread for reduced write latency
14+
- **Parallel On-Disk Writes**: Multi-threaded index persistence for large on-disk indices
1315
- **Lazy Entity Access**: Search results provide direct access to entities without additional lookups
1416
- **Stream API**: Java Stream support for search results
1517
- **GigaMap Integration**: Seamlessly integrates with GigaMap's index system
@@ -163,6 +165,13 @@ List<Document> topDocs = result.stream()
163165
| `indexDirectory` | `null` | Directory for index files (required if `onDisk=true`) |
164166
| `enablePqCompression` | `false` | Enable Product Quantization compression |
165167
| `pqSubspaces` | `0` | Number of PQ subspaces (0 = auto: dimension/4) |
168+
| `parallelOnDiskWrite` | `false` | Use parallel direct buffers and multiple worker threads for on-disk index writing. Speeds up persistence for large indices but uses more resources. Only applies when `onDisk=true` |
169+
170+
### Eventual Indexing
171+
172+
| Parameter | Default | Description |
173+
|-----------|---------|-------------|
174+
| `eventualIndexing` | `false` | Defer HNSW graph mutations to a background thread. The vector store is updated synchronously, but graph construction happens asynchronously. Reduces mutation latency at the cost of eventual search consistency |
166175

167176
### Background Persistence
168177

@@ -223,6 +232,38 @@ VectorIndexConfiguration config = VectorIndexConfiguration.builder()
223232
.build();
224233
```
225234

235+
### Eventual Indexing
236+
237+
For high-throughput systems where mutation latency matters more than immediate search consistency:
238+
239+
```java
240+
VectorIndexConfiguration config = VectorIndexConfiguration.builder()
241+
.dimension(768)
242+
.similarityFunction(VectorSimilarityFunction.COSINE)
243+
// Eventual indexing (graph mutations deferred to background thread)
244+
.eventualIndexing(true)
245+
.build();
246+
```
247+
248+
When enabled, the vector store is always updated synchronously (no data loss), but expensive HNSW graph mutations are queued and applied by a background worker thread. Search results may not immediately reflect the most recent mutations. The queue is automatically drained before `optimize()`, `persistToDisk()`, and `close()`.
249+
250+
### Parallel On-Disk Writes
251+
252+
For large on-disk indices where persistence speed is critical:
253+
254+
```java
255+
VectorIndexConfiguration config = VectorIndexConfiguration.builder()
256+
.dimension(768)
257+
.similarityFunction(VectorSimilarityFunction.COSINE)
258+
.onDisk(true)
259+
.indexDirectory(Path.of("/data/vectors"))
260+
// Parallel on-disk writing (multiple worker threads)
261+
.parallelOnDiskWrite(true)
262+
.build();
263+
```
264+
265+
When enabled, the on-disk graph writer uses parallel direct buffers and multiple worker threads (one per available processor) to write the index concurrently. This is disabled by default as sequential writing is preferred in resource-constrained environments or for smaller indices.
266+
226267
### Manual Optimization and Persistence
227268

228269
```java

gigamap/jvector/pom.xml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
<url>https://projects.eclipse.org/projects/technology.store</url>
2020

2121
<properties>
22-
<jvector.version>4.0.0-rc.7</jvector.version>
22+
<jvector.version>4.0.0-rc.8</jvector.version>
2323
</properties>
2424

2525
<dependencies>
@@ -44,6 +44,12 @@
4444
<artifactId>junit-jupiter-engine</artifactId>
4545
<scope>test</scope>
4646
</dependency>
47+
<dependency>
48+
<groupId>org.awaitility</groupId>
49+
<artifactId>awaitility</artifactId>
50+
<version>4.2.2</version>
51+
<scope>test</scope>
52+
</dependency>
4753
</dependencies>
4854

4955
<build>

gigamap/jvector/src/main/java/org/eclipse/store/gigamap/jvector/BackgroundOptimizationManager.java

Lines changed: 0 additions & 239 deletions
This file was deleted.

0 commit comments

Comments
 (0)