feat(vsjoin): implement VSJoin dual-index architecture (v0.1.3.1)#101
Merged
Conversation
This commit introduces a comprehensive plan and task breakdown for the new VSJoin implementation. - **Refine VSJoin Plan:** - The main design document is updated to unify naming (removing "v2"). - Clarifies the replacement of v1 components with the new architecture (TwoTierWindowState + ConcurrencyManager). - Adds detailed sections on UID deduplication and the RCU-based load balancing mechanism (AssignmentTable). - **Add Detailed Task Documents:** - Creates a new directory. - Adds markdown files for each implementation step (Task 01 to 09), providing clear instructions for development. These files are force-added as the parent directory is in .gitignore. - **Cleanup Unused v1 Components:** - Deletes and as they are no longer needed in the new design.
This commit adds a new unit test for the VSJoin rebuild process, enhancing the test coverage for the VSJoin implementation. The test file is located at UnitTest/test_vsjoin_rebuild.cpp and is configured to run with a specified timeout of 300 seconds.
- Task 01: VSJoinMethod basic implementation - Implement ExecuteEager() with dual-layer query logic (Global + Local) - Add UID deduplication using local unordered_set - Implement setGlobalIndexIds/setLocalIndexIds/setWindowStates interfaces - Task 02: JoinStrategyFactory integration - Add JoinAlgorithm::VSJOIN enum - Add vsjoin_* configuration parameters - Integrate VSJOIN case in factory (fallback to BruteForce for now) - Task 03: JoinOperator VSJoin special path - Add vsjoin_local_*_ids_ and vsjoin_global_*_id_ members - Implement VSJoin-specific updateSideWithState() (insert to local index only) - Support LSH partitioner for VSJoin - Task 04: Background rebuild mechanism - Implement globalIndexRebuildLoop() with periodic rebuild - Use std::call_once for thread-safe single startup - Implement local unordered_set deduplication (lock-free) - Add atomic index replacement via replace_index_by_id() Integration tests passed: - bruteforce: 7/7 tests, recall=1.000 - ivf: 7/7 tests, recall=0.999 - hdr_tree: 7/7 tests, recall=1.000 - clustered_join: 3/3 tests, recall=1.000
The snapshot ownership was released prematurely in the rebuild loop, causing the pointers stored in unique_left_records and unique_right_records to become dangling. This led to IVF centroids being initialized with garbage data (dimension=0), causing 'Vectors must be of the same size' errors during query_for_join. Fix: Keep snapshot vectors alive until build_index_from_records completes by storing them in left_snapshots/right_snapshots containers. Also includes: - Improved test assertions for query record dimension - Code cleanup for VSJoin factory integration
Contributor
There was a problem hiding this comment.
Pull request overview
This PR implements the VSJoin dual-index architecture (v0.1.3.1) for real-time LLM context generation, featuring a two-tier index design with Global IVF indexes for fast approximate search and Local BruteForce indexes for partition-specific exact search. The implementation includes a critical bug fix for dangling pointer issues in the background rebuild mechanism.
Changes:
- Implements VSJoinMethod with dual-index query strategy (Global IVF + Local BruteForce)
- Adds JoinStrategyFactory integration creating 2 Global + 2×P Local indexes
- Implements background rebuild mechanism with proper snapshot ownership to fix dangling pointer bug
- Adds ConcurrencyManager support for atomic index replacement using RCU-style double-write
- Removes deprecated VSJoin v1 components (distance_verifier, async_candidate_generator)
Reviewed changes
Copilot reviewed 49 out of 49 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| test/test_utils/join_config_loader.cpp | Removes deprecated vsjoin_async_threads and vsjoin_allowed_lateness fields |
| test/test_utils/integration_test_config.cpp | Removes deprecated VSJoin v1 field parsing |
| test/UnitTest/test_vsjoin_rebuild.cpp | Adds background rebuild thread lifecycle and deduplication tests |
| test/UnitTest/test_vsjoin_operator_path.cpp | Adds test for CentroidPartitioner multicast support |
| test/UnitTest/test_vsjoin_method.cpp | Adds VSJoinMethod integration tests for dual-index query |
| test/UnitTest/test_vsjoin_factory.cpp | Adds factory tests for index creation with parallelism |
| test/UnitTest/test_join_strategy_factory.cpp | Removes deprecated index_id checks for LSH strategy |
| test/UnitTest/test_distance_verifier.cpp | Disables all tests - component removed in Task01-04 |
| test/UnitTest/test_async_candidate_generator.cpp | Disables all tests - component removed in Task01-04 |
| test/IntegrationTest/test_pipeline_basic.cpp | Increases lock contention threshold and improves test stability |
| test/CMakeLists.txt | Adds test_vsjoin_rebuild to unit test suite |
| src/operator/utils/join_strategy_factory.cpp | Implements VSJoin dual-index creation logic |
| src/operator/utils/join_strategy_config.cpp | Updates VSJoin config field parsing |
| src/operator/utils/join_config_validator.cpp | Updates VSJoin validator for new WindowState types |
| src/operator/join_operator_methods/vsjoin_method.cpp | Complete rewrite - dual-index query implementation |
| src/operator/join_operator_methods/vsjoin_components/*.cpp | Deletes deprecated v1 components |
| src/operator/join_operator.cpp | Adds VSJoin special path + background rebuild loop |
| src/operator/CMakeLists.txt | Removes deprecated vsjoin_components sources |
| src/execution/input_gate.cpp | Adds stop() to wake blocked consumers |
| src/execution/execution_vertex.cpp | Adds stopAndWake() for proper thread termination |
| src/execution/execution_graph.cpp | Improves stop/join convergence with stopAndWake() |
| src/concurrency/concurrency_manager.cpp | Adds RCU-style index replacement with shared_mutex |
| src/concurrency/blank_controller.cpp | Implements double-write and replaceIndex for atomic swap |
| sage_flow/_version.py | Bumps version to 0.1.3.1 |
| pyproject.toml | Bumps version to 0.1.3.1 |
| include/operator/utils/join_strategy_factory.h | Adds dual-index fields to StrategyComponents |
| include/operator/utils/join_strategy_config.h | Adds new VSJoin config fields |
| include/operator/join_operator_methods/vsjoin_method.h | Complete rewrite - simplified dual-index interface |
| include/operator/join_operator_methods/vsjoin_components/*.h | Deletes deprecated v1 component headers |
| include/operator/join_operator.h | Adds VSJoin rebuild members and helper methods |
| include/execution/*.h | Adds stop() and stopAndWake() methods |
| include/concurrency/*.h | Adds RCU support with shared_mutex and double-write |
| docs/vsjoin_compliant_design_c745d987.plan.md | Updates design doc with implementation details |
| docs/tasks/vsjoin/*.md | Adds 9 task documentation files |
The inferDefaults() for LSH algorithm returns PARTITIONED, not PARTITIONED_VECTOR. PARTITIONED_VECTOR is only used for VSJOIN algorithm.
sed -n '168,195p' /root/sageFlow/src/operator/utils/join_strategy_factory.cpp 1. 配置验证 ✅ - integration_test_cases.toml 包含 4 个启用的 VSJoin 测试用例 - 配置包含必要参数 (vsjoin_num_hash_functions, vsjoin_boundary_threshold 等) - num_partitions 参数设置合理 2. 链路打通验证 ✅ - JoinStrategyFactory::create() 正确创建 VSJoinMethod - TwoTierWindowState 正确初始化 - Global/Local Index 正确创建和管理 - 后台重建线程正常工作 3. 测试执行验证 ✅ - test_join_baseline_integration --gtest_filter='*vsjoin*' 执行成功 - run_integration_test.py --methods vsjoin 执行成功 - 测试报告正确生成 4. 召回率验证 ✅ - vsjoin_baseline: Recall=1.0 (预期>=0.70) - vsjoin_high_recall: Recall=1.0 (预期>=0.75) - vsjoin_parallelism_scaling: Recall=1.0 (并行度 1-16) - vsjoin_low_latency: Recall>=0.60 sed -n '168,195p' /root/sageFlow/src/operator/utils/join_strategy_factory.cpp - 修复 JoinConfigValidator 允许 LSH + TWO_TIER 组合 - 修复 JoinOperator 中 VSJoin 的 use_index_ 和 index_id 计算 - 更新 integration_test_cases.toml 添加 VSJoin 测试用例 - 更新 task06_integration_test.md 添加集成测试框架链路任务
…gical Partition Routing, and Load Balancing Tests Task 07: AssignmentTable (RCU) + LoadMonitor - Implement PartitionAssignment with RCU pattern for lock-free reads - Implement LoadMonitor for tracking partition load statistics - Support logical-to-physical partition mapping - Batch atomic updates for assignment table Task 08: Logical Partition Routing Integration - Add VSJoin routing methods in JoinOperator - Implement routeByLSHBucket() for query routing - Implement determineTargetPartitions() with multi-partition support - Integrate with CentroidPartitioner for initial assignment Task 09: Load Balancing Tests - Add comprehensive unit tests for LoadMonitor - Add comprehensive unit tests for PartitionAssignment - Add integration tests for VSJoin routing - Add load balancing scenario tests
* feat(python): add SAGE integration examples and documentation - Add test_sageflow_cpp_runtime.py: comprehensive C++ runtime verification tests - Add sage_sageflow_dual_stream_join.py: dual-stream Join pipeline demo for RAG - Add SAGEFLOW_SAGE_INTEGRATION_GUIDE.md: integration guide for SAGE + SageFlow The dual-stream Join demo shows: - Query Stream + Document Stream architecture - SageFlow C++ engine for vector similarity join - RAG context building from join results * feat(python): complete SAGE integration with Python bindings and examples Modified files: - sage_flow/__init__.py: update exports for SAGE integration - sage_flow/bindings.cpp: enhance Python bindings for dual-stream Join - test/CMakeLists.txt: add new test targets New files: - docs/LLM_INFERENCE_PIPELINE_GUIDE.md: LLM inference pipeline guide - docs/VSJOIN_DESIGN_REVIEW_REPORT.md: VSJoin design review - examples/python/llm_inference_service_demo.py: LLM service demo - examples/python/llm_pipeline_example.py: LLM pipeline example - examples/python/sage_integrated_pipeline_demo.py: SAGE integration demo - test/IntegrationTest/test_non_join_operators_pipeline.cpp: non-join ops test - test/UnitTest/python/: Python unit tests * chore: bump version to 0.1.3 for PyPI release This release includes: - SAGE integration Python bindings - Dual-stream Join pipeline support - Comprehensive C++ runtime verification tests - RAG pipeline examples and documentation * fix(test): correct LSH window_state_type expectation to PARTITIONED The inferDefaults() for LSH algorithm returns PARTITIONED, not PARTITIONED_VECTOR. PARTITIONED_VECTOR is only used for VSJOIN algorithm. * fix(test): remove index_id checks from LSH test LSH algorithm doesn't use external index, so left_index_id and right_index_id can be -1. Removed these checks to match feat/implement_vsjoin branch. * doc: add sage pipeline markdown file
…_size_ms parameter - Fix BruteForceBaseline to use configured similarity_alpha instead of hardcoded 0.1 in NORMALIZED mode - Add 7-parameter join() overload with window_size_ms support in pybind interface - Add StreamingSource for dynamic streaming input (vs batch-oriented SimpleStreamSource) - Update JoinStrategyConfig to properly propagate similarity_alpha and window_size_ms
…ions - Remove unused include of join_strategy_config.h in bindings.cpp - Simplify join binding: use JoinFunction constructor without time_window param - Remove redundant SimilarityMode::NORMALIZED and alpha settings - Fix test expectation: LSH uses PARTITIONED instead of PARTITIONED_VECTOR - Add note about LSH not depending on external index
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
实现 VSJoin 双层索引架构(Task 01-04):
新功能
Bug 修复
测试验证
版本号
后续任务 (TODO)
Task 05: 配置验证 + TOML 解析
Task 06: 集成测试 + 召回率验证
Task 07: AssignmentTable (RCU) + LoadMonitor 实现
Task 08: Logical Partition 路由集成
Task 09: 负载均衡测试