Skip to content

feat: semantic search with hybrid RRF fusion and content enrich API#1238

Open
chethann007 wants to merge 9 commits into
developfrom
feat/semantic-search
Open

feat: semantic search with hybrid RRF fusion and content enrich API#1238
chethann007 wants to merge 9 commits into
developfrom
feat/semantic-search

Conversation

@chethann007
Copy link
Copy Markdown
Collaborator

Summary

Introduces semantic search (kNN vector similarity) to the search API, extends it with hybrid text+semantic fusion, and adds an enrich API endpoint for triggering async metadata enrichment without republishing.

Semantic search

A new search_mode=semantic runs kNN queries against pre-indexed vector embeddings, ranking results by cosine similarity. Filters from the request are correctly inherited into the kNN filter clause — fixes ship for two silent correctness bugs: full-text propertyName='*' leaking into filters (dropping all docs) and fuzzy mode bypassing the BoolQueryBuilder filter set. Semantic results always surface cosine score; default name/date sort is skipped so kNN ranking is preserved. min_score threshold supported to drop low-confidence matches.

Hybrid search

search_mode=hybrid runs text and semantic queries in parallel and fuses results via Reciprocal Rank Fusion — score = Σ 1/(k + rank_i), k=60. Each leg runs in a fresh SearchProcessor instance to avoid race conditions on mutable sort state. Results merge into a single ranked list with score_components (text_rank, semantic_rank) per hit. Pagination applied post-fusion; facets borrowed from text leg. Existing text-mode clients unaffected.

Enrich API

POST /content/v3/enrich accepts a list of content identifiers, validates each exists, derives objectType from mimeType, and emits BE_JOB_REQUEST Kafka events with action=enrich to the publish topic — triggering EnrichOnlyFunction in knowlg-publish to re-emit enriched metadata without running the full publish flow. Useful for bulk backfilling existing content into the semantic index after field config or framework changes.

Test plan

  • search_mode=semantic with filters — verify filters applied correctly, cosine scores present, kNN ordering preserved
  • min_score — verify hits below threshold excluded
  • POST /content/v3/enrich — verify Kafka event emitted, EnrichOnlyFunction picks it up
  • Existing search_mode=text requests — verify no regression

…ding, int8 quantization

Add semantic search support behind a new search_mode dispatch on /v3/search.
Defaults to text mode so existing clients are unaffected.

- QueryStrategy interface + TextQueryStrategy (wraps existing logic) +
  SemanticQueryStrategy (nested kNN on chunks.embedding).
- Pluggable EmbeddingClient (OpenAI/Azure, E5) + factory mirrors the
  content-embedding-job's service layer so query-time and index-time
  vectors share a contract.
- Int8 QuantizationStrategy ports the job's L2-norm-detection branch.
- SHA-256 keyed LRU+TTL EmbeddingCache.
- application.conf semantic_search block, disabled by default.
- Docs: DESIGN, API_SPEC, IMPLEMENTATION_PLAN, FLOWCHART under docs/semantic-search/.
- HybridQueryStrategy.build now falls back to text instead of throwing,
  so processCount, getCollectionsResult, and external processSearchQuery
  callers don't crash when given a hybrid-mode DTO.
- SemanticQueryStrategy: force fuzzy=false while inheriting filters.
  prepareFilteredSearchQuery returns a FunctionScoreQueryBuilder, not a
  BoolQueryBuilder, which silently dropped every filter and left kNN
  running unconstrained against the whole index.
- HybridSearchExecutor: instantiate a fresh SearchProcessor per leg.
  The shared processor.relevanceSort flag races between parallel text
  and semantic sub-searches and can flip sort order non-deterministically.
…Strategy

getAllFieldsPropertyQuery wraps the multi_match in a BoolQueryBuilder,
so the previous getName()-based isFullTextLeg detection missed it. The
wrapped should-only bool then slipped into the outer filter clause where
its implicit minimum_should_match=1 turned every query term into a hard
filter requirement and excluded all matches.

Now we re-run the text query builder against a DTO whose properties have
had the propertyName='*' entry stripped, so the inherited bool contains
only the property filters. Implicit-filter shoulds are also kept as
filters to preserve their constraint behavior.
- Skip the default name/lastUpdatedOn sort when search_mode=semantic so
  the kNN relevance order is preserved.
- Return results via the with-score path for semantic mode so callers
  see the cosine score on each hit. Text mode still returns scores only
  when fuzzy=true (unchanged contract).
- Add POST /content/v3/enrich endpoint in ContentController
- ContentActor routes triggerEnrich to EnrichManager
- EnrichManager validates IDs via DataNode.read, derives objectType from mimeType
- Emits BE_JOB_REQUEST events with action=enrich to publish.job.request topic
- Returns ERR_CONTENT_NOT_FOUND for non-existent identifiers
- Add TRIGGER_ENRICH api ID and update routes in both knowlg-service and content-service
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76ddf297-6195-4c93-ab93-c40846a295b4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/semantic-search

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Please review the analysis results for each service. Ensure all quality gates are passing before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant