Skip to content

feat(vsjoin): implement VSJoin dual-index architecture (v0.1.3.1)#101

Merged
ZeroJustMe merged 18 commits into
main-devfrom
feat/implement_vsjoin
Mar 3, 2026
Merged

feat(vsjoin): implement VSJoin dual-index architecture (v0.1.3.1)#101
ZeroJustMe merged 18 commits into
main-devfrom
feat/implement_vsjoin

Conversation

@ZeroJustMe
Copy link
Copy Markdown
Collaborator

@ZeroJustMe ZeroJustMe commented Jan 21, 2026

实现 VSJoin 双层索引架构(Task 01-04):

新功能

  • Task 01: VSJoinMethod 基础实现,包含双层索引查询策略
  • Task 02: JoinStrategyFactory 集成 - 创建 2 个 Global IVF + 2×P 个 Local BruteForce 索引
  • Task 03: JoinOperator VSJoin 特殊路径,用于索引路由
  • Task 04: 后台重建机制,定期刷新 Global Index

Bug 修复

  • 关键修复: 修复 `globalIndexRebuildLoop` 中的悬空指针问题
    • 快照所有权过早释放,导致 IVF 质心被初始化为垃圾数据(dimension=0)
    • 修复方案:保持快照向量存活直到 `build_index_from_records` 完成

测试验证

  • VSJoin rebuild 测试通过 (2/2)
  • JoinStrategyFactory 测试通过 (29/29)
  • 集成测试通过:bruteforce, ivf, hdr_tree, clustered_join

版本号

  • 更新至 0.1.3.1

后续任务 (TODO)

  • Task 05: 配置验证 + TOML 解析

    • 在 JoinStrategyConfig 中添加 VSJoin 相关配置字段
    • 实现配置验证逻辑和 TOML 配置文件解析
  • Task 06: 集成测试 + 召回率验证

    • VSJoin 双层索引查询功能验证
    • 多播策略下边界向量不丢失(召回率验证)
    • Global Index 重建和去重功能测试
  • Task 07: AssignmentTable (RCU) + LoadMonitor 实现

    • VSJoinPartitionAssignment:RCU 实现逻辑分区到物理 subtask 的映射
    • VSJoinLoadMonitor:采样和聚合各 subtask 的负载信息
  • Task 08: Logical Partition 路由集成

    • 扩展 LSH 分区器支持 logical partition
    • 集成 AssignmentTable 实现逻辑分区到物理 subtask 的映射
  • Task 09: 负载均衡测试

    • AssignmentTable 并发安全性测试(RCU 读操作无锁,批量更新原子性)
    • 负载均衡效果验证
    • LoadMonitor 功能正确性验证

This commit introduces a comprehensive plan and task breakdown for the new VSJoin implementation.

- **Refine VSJoin Plan:**
  - The main design document  is updated to unify naming (removing "v2").
  - Clarifies the replacement of v1 components with the new architecture (TwoTierWindowState + ConcurrencyManager).
  - Adds detailed sections on UID deduplication and the RCU-based load balancing mechanism (AssignmentTable).

- **Add Detailed Task Documents:**
  - Creates a new  directory.
  - Adds markdown files for each implementation step (Task 01 to 09), providing clear instructions for development. These files are force-added as the parent directory is in .gitignore.

- **Cleanup Unused v1 Components:**
  - Deletes  and  as they are no longer needed in the new design.
This commit adds a new unit test for the VSJoin rebuild process, enhancing the test coverage for the VSJoin implementation. The test file is located at UnitTest/test_vsjoin_rebuild.cpp and is configured to run with a specified timeout of 300 seconds.
- Task 01: VSJoinMethod basic implementation
  - Implement ExecuteEager() with dual-layer query logic (Global + Local)
  - Add UID deduplication using local unordered_set
  - Implement setGlobalIndexIds/setLocalIndexIds/setWindowStates interfaces

- Task 02: JoinStrategyFactory integration
  - Add JoinAlgorithm::VSJOIN enum
  - Add vsjoin_* configuration parameters
  - Integrate VSJOIN case in factory (fallback to BruteForce for now)

- Task 03: JoinOperator VSJoin special path
  - Add vsjoin_local_*_ids_ and vsjoin_global_*_id_ members
  - Implement VSJoin-specific updateSideWithState() (insert to local index only)
  - Support LSH partitioner for VSJoin

- Task 04: Background rebuild mechanism
  - Implement globalIndexRebuildLoop() with periodic rebuild
  - Use std::call_once for thread-safe single startup
  - Implement local unordered_set deduplication (lock-free)
  - Add atomic index replacement via replace_index_by_id()

Integration tests passed:
- bruteforce: 7/7 tests, recall=1.000
- ivf: 7/7 tests, recall=0.999
- hdr_tree: 7/7 tests, recall=1.000
- clustered_join: 3/3 tests, recall=1.000
The snapshot ownership was released prematurely in the rebuild loop,
causing the pointers stored in unique_left_records and unique_right_records
to become dangling. This led to IVF centroids being initialized with
garbage data (dimension=0), causing 'Vectors must be of the same size'
errors during query_for_join.

Fix: Keep snapshot vectors alive until build_index_from_records completes
by storing them in left_snapshots/right_snapshots containers.

Also includes:
- Improved test assertions for query record dimension
- Code cleanup for VSJoin factory integration
Copilot AI review requested due to automatic review settings January 21, 2026 14:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the VSJoin dual-index architecture (v0.1.3.1) for real-time LLM context generation, featuring a two-tier index design with Global IVF indexes for fast approximate search and Local BruteForce indexes for partition-specific exact search. The implementation includes a critical bug fix for dangling pointer issues in the background rebuild mechanism.

Changes:

  • Implements VSJoinMethod with dual-index query strategy (Global IVF + Local BruteForce)
  • Adds JoinStrategyFactory integration creating 2 Global + 2×P Local indexes
  • Implements background rebuild mechanism with proper snapshot ownership to fix dangling pointer bug
  • Adds ConcurrencyManager support for atomic index replacement using RCU-style double-write
  • Removes deprecated VSJoin v1 components (distance_verifier, async_candidate_generator)

Reviewed changes

Copilot reviewed 49 out of 49 changed files in this pull request and generated no comments.

Show a summary per file
File Description
test/test_utils/join_config_loader.cpp Removes deprecated vsjoin_async_threads and vsjoin_allowed_lateness fields
test/test_utils/integration_test_config.cpp Removes deprecated VSJoin v1 field parsing
test/UnitTest/test_vsjoin_rebuild.cpp Adds background rebuild thread lifecycle and deduplication tests
test/UnitTest/test_vsjoin_operator_path.cpp Adds test for CentroidPartitioner multicast support
test/UnitTest/test_vsjoin_method.cpp Adds VSJoinMethod integration tests for dual-index query
test/UnitTest/test_vsjoin_factory.cpp Adds factory tests for index creation with parallelism
test/UnitTest/test_join_strategy_factory.cpp Removes deprecated index_id checks for LSH strategy
test/UnitTest/test_distance_verifier.cpp Disables all tests - component removed in Task01-04
test/UnitTest/test_async_candidate_generator.cpp Disables all tests - component removed in Task01-04
test/IntegrationTest/test_pipeline_basic.cpp Increases lock contention threshold and improves test stability
test/CMakeLists.txt Adds test_vsjoin_rebuild to unit test suite
src/operator/utils/join_strategy_factory.cpp Implements VSJoin dual-index creation logic
src/operator/utils/join_strategy_config.cpp Updates VSJoin config field parsing
src/operator/utils/join_config_validator.cpp Updates VSJoin validator for new WindowState types
src/operator/join_operator_methods/vsjoin_method.cpp Complete rewrite - dual-index query implementation
src/operator/join_operator_methods/vsjoin_components/*.cpp Deletes deprecated v1 components
src/operator/join_operator.cpp Adds VSJoin special path + background rebuild loop
src/operator/CMakeLists.txt Removes deprecated vsjoin_components sources
src/execution/input_gate.cpp Adds stop() to wake blocked consumers
src/execution/execution_vertex.cpp Adds stopAndWake() for proper thread termination
src/execution/execution_graph.cpp Improves stop/join convergence with stopAndWake()
src/concurrency/concurrency_manager.cpp Adds RCU-style index replacement with shared_mutex
src/concurrency/blank_controller.cpp Implements double-write and replaceIndex for atomic swap
sage_flow/_version.py Bumps version to 0.1.3.1
pyproject.toml Bumps version to 0.1.3.1
include/operator/utils/join_strategy_factory.h Adds dual-index fields to StrategyComponents
include/operator/utils/join_strategy_config.h Adds new VSJoin config fields
include/operator/join_operator_methods/vsjoin_method.h Complete rewrite - simplified dual-index interface
include/operator/join_operator_methods/vsjoin_components/*.h Deletes deprecated v1 component headers
include/operator/join_operator.h Adds VSJoin rebuild members and helper methods
include/execution/*.h Adds stop() and stopAndWake() methods
include/concurrency/*.h Adds RCU support with shared_mutex and double-write
docs/vsjoin_compliant_design_c745d987.plan.md Updates design doc with implementation details
docs/tasks/vsjoin/*.md Adds 9 task documentation files

The inferDefaults() for LSH algorithm returns PARTITIONED, not PARTITIONED_VECTOR.
PARTITIONED_VECTOR is only used for VSJOIN algorithm.
@ZeroJustMe ZeroJustMe self-assigned this Jan 21, 2026
sed -n '168,195p' /root/sageFlow/src/operator/utils/join_strategy_factory.cpp

1. 配置验证 ✅
   - integration_test_cases.toml 包含 4 个启用的 VSJoin 测试用例
   - 配置包含必要参数 (vsjoin_num_hash_functions, vsjoin_boundary_threshold 等)
   - num_partitions 参数设置合理

2. 链路打通验证 ✅
   - JoinStrategyFactory::create() 正确创建 VSJoinMethod
   - TwoTierWindowState 正确初始化
   - Global/Local Index 正确创建和管理
   - 后台重建线程正常工作

3. 测试执行验证 ✅
   - test_join_baseline_integration --gtest_filter='*vsjoin*' 执行成功
   - run_integration_test.py --methods vsjoin 执行成功
   - 测试报告正确生成

4. 召回率验证 ✅
   - vsjoin_baseline: Recall=1.0 (预期>=0.70)
   - vsjoin_high_recall: Recall=1.0 (预期>=0.75)
   - vsjoin_parallelism_scaling: Recall=1.0 (并行度 1-16)
   - vsjoin_low_latency: Recall>=0.60

sed -n '168,195p' /root/sageFlow/src/operator/utils/join_strategy_factory.cpp
- 修复 JoinConfigValidator 允许 LSH + TWO_TIER 组合
- 修复 JoinOperator 中 VSJoin 的 use_index_ 和 index_id 计算
- 更新 integration_test_cases.toml 添加 VSJoin 测试用例
- 更新 task06_integration_test.md 添加集成测试框架链路任务
…gical Partition Routing, and Load Balancing Tests

Task 07: AssignmentTable (RCU) + LoadMonitor
- Implement PartitionAssignment with RCU pattern for lock-free reads
- Implement LoadMonitor for tracking partition load statistics
- Support logical-to-physical partition mapping
- Batch atomic updates for assignment table

Task 08: Logical Partition Routing Integration
- Add VSJoin routing methods in JoinOperator
- Implement routeByLSHBucket() for query routing
- Implement determineTargetPartitions() with multi-partition support
- Integrate with CentroidPartitioner for initial assignment

Task 09: Load Balancing Tests
- Add comprehensive unit tests for LoadMonitor
- Add comprehensive unit tests for PartitionAssignment
- Add integration tests for VSJoin routing
- Add load balancing scenario tests
* feat(python): add SAGE integration examples and documentation

- Add test_sageflow_cpp_runtime.py: comprehensive C++ runtime verification tests
- Add sage_sageflow_dual_stream_join.py: dual-stream Join pipeline demo for RAG
- Add SAGEFLOW_SAGE_INTEGRATION_GUIDE.md: integration guide for SAGE + SageFlow

The dual-stream Join demo shows:
- Query Stream + Document Stream architecture
- SageFlow C++ engine for vector similarity join
- RAG context building from join results

* feat(python): complete SAGE integration with Python bindings and examples

Modified files:
- sage_flow/__init__.py: update exports for SAGE integration
- sage_flow/bindings.cpp: enhance Python bindings for dual-stream Join
- test/CMakeLists.txt: add new test targets

New files:
- docs/LLM_INFERENCE_PIPELINE_GUIDE.md: LLM inference pipeline guide
- docs/VSJOIN_DESIGN_REVIEW_REPORT.md: VSJoin design review
- examples/python/llm_inference_service_demo.py: LLM service demo
- examples/python/llm_pipeline_example.py: LLM pipeline example
- examples/python/sage_integrated_pipeline_demo.py: SAGE integration demo
- test/IntegrationTest/test_non_join_operators_pipeline.cpp: non-join ops test
- test/UnitTest/python/: Python unit tests

* chore: bump version to 0.1.3 for PyPI release

This release includes:
- SAGE integration Python bindings
- Dual-stream Join pipeline support
- Comprehensive C++ runtime verification tests
- RAG pipeline examples and documentation

* fix(test): correct LSH window_state_type expectation to PARTITIONED

The inferDefaults() for LSH algorithm returns PARTITIONED, not PARTITIONED_VECTOR.
PARTITIONED_VECTOR is only used for VSJOIN algorithm.

* fix(test): remove index_id checks from LSH test

LSH algorithm doesn't use external index, so left_index_id and right_index_id
can be -1. Removed these checks to match feat/implement_vsjoin branch.

* doc: add sage pipeline markdown file
…_size_ms parameter

- Fix BruteForceBaseline to use configured similarity_alpha instead of hardcoded 0.1 in NORMALIZED mode
- Add 7-parameter join() overload with window_size_ms support in pybind interface
- Add StreamingSource for dynamic streaming input (vs batch-oriented SimpleStreamSource)
- Update JoinStrategyConfig to properly propagate similarity_alpha and window_size_ms
…ions

- Remove unused include of join_strategy_config.h in bindings.cpp
- Simplify join binding: use JoinFunction constructor without time_window param
- Remove redundant SimilarityMode::NORMALIZED and alpha settings
- Fix test expectation: LSH uses PARTITIONED instead of PARTITIONED_VECTOR
- Add note about LSH not depending on external index
@ZeroJustMe ZeroJustMe marked this pull request as ready for review March 3, 2026 03:06
@ZeroJustMe ZeroJustMe merged commit 613ad56 into main-dev Mar 3, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants