feat: semantic search with hybrid RRF fusion and content enrich API#1238
feat: semantic search with hybrid RRF fusion and content enrich API#1238chethann007 wants to merge 9 commits into
Conversation
…ding, int8 quantization Add semantic search support behind a new search_mode dispatch on /v3/search. Defaults to text mode so existing clients are unaffected. - QueryStrategy interface + TextQueryStrategy (wraps existing logic) + SemanticQueryStrategy (nested kNN on chunks.embedding). - Pluggable EmbeddingClient (OpenAI/Azure, E5) + factory mirrors the content-embedding-job's service layer so query-time and index-time vectors share a contract. - Int8 QuantizationStrategy ports the job's L2-norm-detection branch. - SHA-256 keyed LRU+TTL EmbeddingCache. - application.conf semantic_search block, disabled by default. - Docs: DESIGN, API_SPEC, IMPLEMENTATION_PLAN, FLOWCHART under docs/semantic-search/.
- HybridQueryStrategy.build now falls back to text instead of throwing, so processCount, getCollectionsResult, and external processSearchQuery callers don't crash when given a hybrid-mode DTO. - SemanticQueryStrategy: force fuzzy=false while inheriting filters. prepareFilteredSearchQuery returns a FunctionScoreQueryBuilder, not a BoolQueryBuilder, which silently dropped every filter and left kNN running unconstrained against the whole index. - HybridSearchExecutor: instantiate a fresh SearchProcessor per leg. The shared processor.relevanceSort flag races between parallel text and semantic sub-searches and can flip sort order non-deterministically.
…Strategy getAllFieldsPropertyQuery wraps the multi_match in a BoolQueryBuilder, so the previous getName()-based isFullTextLeg detection missed it. The wrapped should-only bool then slipped into the outer filter clause where its implicit minimum_should_match=1 turned every query term into a hard filter requirement and excluded all matches. Now we re-run the text query builder against a DTO whose properties have had the propertyName='*' entry stripped, so the inherited bool contains only the property filters. Implicit-filter shoulds are also kept as filters to preserve their constraint behavior.
- Skip the default name/lastUpdatedOn sort when search_mode=semantic so the kNN relevance order is preserved. - Return results via the with-score path for semantic mode so callers see the cosine score on each hit. Text mode still returns scores only when fuzzy=true (unchanged contract).
- Add POST /content/v3/enrich endpoint in ContentController - ContentActor routes triggerEnrich to EnrichManager - EnrichManager validates IDs via DataNode.read, derives objectType from mimeType - Emits BE_JOB_REQUEST events with action=enrich to publish.job.request topic - Returns ERR_CONTENT_NOT_FOUND for non-existent identifiers - Add TRIGGER_ENRICH api ID and update routes in both knowlg-service and content-service
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
SonarCloud Analysis Results 🔍Quality Gate Results for Services:Please review the analysis results for each service. Ensure all quality gates are passing before merging. |



Summary
Introduces semantic search (kNN vector similarity) to the search API, extends it with hybrid text+semantic fusion, and adds an enrich API endpoint for triggering async metadata enrichment without republishing.
Semantic search
A new
search_mode=semanticruns kNN queries against pre-indexed vector embeddings, ranking results by cosine similarity. Filters from the request are correctly inherited into the kNN filter clause — fixes ship for two silent correctness bugs: full-textpropertyName='*'leaking into filters (dropping all docs) and fuzzy mode bypassing theBoolQueryBuilderfilter set. Semantic results always surface cosine score; default name/date sort is skipped so kNN ranking is preserved.min_scorethreshold supported to drop low-confidence matches.Hybrid search
search_mode=hybridruns text and semantic queries in parallel and fuses results via Reciprocal Rank Fusion —score = Σ 1/(k + rank_i), k=60. Each leg runs in a freshSearchProcessorinstance to avoid race conditions on mutable sort state. Results merge into a single ranked list withscore_components(text_rank, semantic_rank) per hit. Pagination applied post-fusion; facets borrowed from text leg. Existing text-mode clients unaffected.Enrich API
POST /content/v3/enrichaccepts a list of content identifiers, validates each exists, derives objectType from mimeType, and emitsBE_JOB_REQUESTKafka events withaction=enrichto the publish topic — triggeringEnrichOnlyFunctionin knowlg-publish to re-emit enriched metadata without running the full publish flow. Useful for bulk backfilling existing content into the semantic index after field config or framework changes.Test plan
search_mode=semanticwith filters — verify filters applied correctly, cosine scores present, kNN ordering preservedmin_score— verify hits below threshold excludedPOST /content/v3/enrich— verify Kafka event emitted, EnrichOnlyFunction picks it upsearch_mode=textrequests — verify no regression