Conversation
Adds comprehensive semantic search capabilities to the Logseq knowledge base application using local embeddings (fastembed-rs) and vector storage (Qdrant). This implementation follows the simplified DDD architecture established in the project. ## Domain Layer Extensions - **Value Objects**: - `ChunkId`: Unique identifier for text chunks (format: block-id-chunk-index) - `EmbeddingVector`: 384-dimensional vector with cosine similarity computation - `SimilarityScore`: Normalized similarity score (0.0-1.0) - `EmbeddingModel`: Enum for supported models (currently all-MiniLM-L6-v2) - **Entities**: - `TextChunk`: Preprocessed text with metadata (block/page context, hierarchy path, embeddings) ## Infrastructure Layer - **FastEmbed Service** (`fastembed_service.rs`): - Local embedding generation using fastembed 5.2 - all-MiniLM-L6-v2 model (384 dimensions, ~90MB) - Batch processing for efficiency - Async/await for non-blocking operations - **Qdrant Vector Store** (`qdrant_store.rs`): - Qdrant client integration (requires Docker instance) - Collection management with cosine distance metric - Metadata storage (page/block IDs, hierarchy, content) - Point-based CRUD operations - Filter support for page-scoped searches - **Text Preprocessor** (`text_preprocessor.rs`): - Logseq syntax cleaning (removes TODO/DONE markers) - Page reference conversion ([[page]] → page) - Tag normalization (#tag → tag) - Context addition (page title + hierarchy path) - Smart chunking with word overlap (configurable) ## Application Layer - **Embedding Service** (`embedding_service.rs`): - Orchestrates preprocessing → embedding → storage pipeline - Configurable batch processing (default: 32 chunks) - Page-level and bulk operations - Statistics tracking (blocks processed, chunks created/stored, errors) - Delete operations (by page or block) - Vector store statistics - **Search Integration** (`search.rs`): - Extended `SearchPagesAndBlocks` with semantic search support - New `SearchType` enum (Traditional, Semantic) - Async `execute()` method (breaking change) - Combined semantic + traditional search results - Maintains existing filtering (pages, result types) ## Testing - Updated all integration tests to async/await - Added comprehensive semantic search integration tests: - Semantic similarity validation - Page filtering - Chunking for long content - Hierarchical context preservation - Semantic vs traditional search comparison - Embedding statistics and collection management - Delete operations ## Dependencies - `fastembed = "5.2"`: Local embedding generation - `qdrant-client = "1.11"`: Vector database client - `regex = "1.10"`: Text preprocessing - `chrono = "0.4"`: Timestamp handling ## Configuration Default configuration (EmbeddingServiceConfig): - Model: all-MiniLM-L6-v2 (384 dimensions) - Qdrant URL: http://localhost:6334 - Collection: logseq_blocks - Max words per chunk: 150 (~512 tokens with margin) - Overlap words: 50 - Batch size: 32 ## Testing All tests pass (169 total: 162 passed, 7 ignored): - 7 semantic search tests require running Qdrant instance (ignored by default) - Use `cargo test -- --ignored` to run with Qdrant 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements semantic search capabilities for the Logseq knowledge base using local embeddings (fastembed-rs) and vector storage (Qdrant). This follows the simplified DDD architecture established in the project.
Key Features
Domain Layer Extensions
Value Objects
ChunkId: Unique identifier for text chunksEmbeddingVector: 384-dimensional vectors with cosine similarity computationSimilarityScore: Normalized similarity scores (0.0-1.0)EmbeddingModel: Supported embedding models enumEntities
TextChunk: Preprocessed text with full metadata (block/page context, hierarchy, embeddings)Infrastructure Layer
FastEmbed Service
Qdrant Vector Store
Text Preprocessor
Application Layer
Embedding Service
Search Integration
SearchPagesAndBlockswith semantic searchSearchTypeenum (Traditional, Semantic)execute()methodTesting
#[ignore]require running Qdrant instanceRun with Qdrant:
cargo test -- --ignoredDependencies Added
fastembed = "5.2": Local embedding generationqdrant-client = "1.11": Vector database clientregex = "1.10": Text preprocessingchrono = "0.4": Timestamp handlingConfiguration
Default
EmbeddingServiceConfig:Setup Requirements
Requires Qdrant running locally (Docker recommended):
Files Changed
Breaking Changes
SearchPagesAndBlocks::execute()is now async (requires.await)Test Plan
🤖 Generated with Claude Code