Skip to content

feat: Add BM25 lexical search support for Milvus destination#646

Draft
micmarty-deepsense wants to merge 2 commits intomainfrom
feature/milvus-lexical-search
Draft

feat: Add BM25 lexical search support for Milvus destination#646
micmarty-deepsense wants to merge 2 commits intomainfrom
feature/milvus-lexical-search

Conversation

@micmarty-deepsense
Copy link
Contributor

Summary

  • Add enable_lexical_search configuration flag to Milvus destination connector
  • Add create_bm25_schema() helper method for creating BM25-enabled collection schemas
  • Add unit tests for new configuration and schema generation

Details

This PR adds support for BM25 lexical (full-text) search in Milvus destinations using Milvus 2.5+ built-in BM25 Function API.

Key Changes:

  1. Configuration Flag: Added enable_lexical_search boolean field to MilvusUploadStagerConfig

    • Defaults to False for backward compatibility
    • Requires Milvus 2.5+ and BM25-enabled collection schema
  2. Schema Helper: Added create_bm25_schema() static method to MilvusUploadStager

    • Creates collection schema with BM25 Function that auto-generates sparse vectors
    • Includes proper field configuration: text field with enable_analyzer=True, sparse_vector field, dense embedding field
    • Returns schema and index params ready for MilvusClient.create_collection()
  3. Tests: Added comprehensive unit tests

    • Test configuration flag default and custom values
    • Test schema structure and field types
    • Test BM25 function configuration
    • Test index parameters

Usage Pattern:

Collection schema must be created BEFORE ingestion using the helper:

from pymilvus import MilvusClient
from unstructured_ingest.processes.connectors.milvus import MilvusUploadStager

client = MilvusClient(uri="http://localhost:19530")
schema, index_params = MilvusUploadStager.create_bm25_schema(
    collection_name="my_docs",
    vector_dim=384
)
client.create_collection(
    collection_name="my_docs",
    schema=schema,
    index_params=index_params
)

# Then ingest with enable_lexical_search=True
# BM25 function auto-generates sparse_vector from text field

Requirements:

  • Milvus 2.5+ (BM25 Function API)
  • Manual schema creation before ingestion
  • pymilvus dependency

Related:

  • Paired with platform-api PR that adds enable_lexical_search to connector config input model
  • Follows AstraDB pattern for lexical search implementation

Test Plan

  • Unit tests pass for new configuration flag
  • Unit tests pass for schema helper method
  • Schema structure validates correctly (text field with enable_analyzer, sparse_vector field, BM25 function)
  • Index params validate correctly

Add configuration flag and helper method to enable BM25 full-text search
in Milvus 2.5+ destinations using the built-in BM25 Function API.

Changes:
- Add enable_lexical_search flag to MilvusUploadStagerConfig
- Add create_bm25_schema() static helper for creating BM25-enabled collection schemas
- Add unit tests for new configuration and schema generation
- BM25 function auto-generates sparse vectors from text field for lexical search

Requires Milvus 2.5+ and manual schema creation before ingestion using the
provided create_bm25_schema() helper method.
Add enable_lexical_search flag to MilvusUploadStagerConfig to indicate
that the collection is configured for BM25 full-text search with sparse vectors.

Add create_bm25_schema() static helper method that provides example schema
for Milvus 2.5+ BM25 Function API. Users must manually create collection
with this schema before ingestion.

The BM25 Function auto-generates sparse vectors from text content for
keyword-based lexical search.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant