feat: semantic search with hybrid RRF fusion and content enrich API by chethann007 · Pull Request #1238 · Sunbird-Knowlg/knowledge-platform

chethann007 · 2026-05-27T00:08:31Z

Summary

Introduces semantic search (kNN vector similarity) to the search API, extends it with hybrid text+semantic fusion, and adds an enrich API endpoint for triggering async metadata enrichment without republishing.

Semantic search

A new search_mode=semantic runs kNN queries against pre-indexed vector embeddings, ranking results by cosine similarity. Filters from the request are correctly inherited into the kNN filter clause — fixes ship for two silent correctness bugs: full-text propertyName='*' leaking into filters (dropping all docs) and fuzzy mode bypassing the BoolQueryBuilder filter set. Semantic results always surface cosine score; default name/date sort is skipped so kNN ranking is preserved. min_score threshold supported to drop low-confidence matches.

Hybrid search

search_mode=hybrid runs text and semantic queries in parallel and fuses results via Reciprocal Rank Fusion — score = Σ 1/(k + rank_i), k=60. Each leg runs in a fresh SearchProcessor instance to avoid race conditions on mutable sort state. Results merge into a single ranked list with score_components (text_rank, semantic_rank) per hit. Pagination applied post-fusion; facets borrowed from text leg. Existing text-mode clients unaffected.

Enrich API

POST /content/v3/enrich accepts a list of content identifiers, validates each exists, derives objectType from mimeType, and emits BE_JOB_REQUEST Kafka events with action=enrich to the publish topic — triggering EnrichOnlyFunction in knowlg-publish to re-emit enriched metadata without running the full publish flow. Useful for bulk backfilling existing content into the semantic index after field config or framework changes.

Test plan

search_mode=semantic with filters — verify filters applied correctly, cosine scores present, kNN ordering preserved
min_score — verify hits below threshold excluded
POST /content/v3/enrich — verify Kafka event emitted, EnrichOnlyFunction picks it up
Existing search_mode=text requests — verify no regression

…ding, int8 quantization Add semantic search support behind a new search_mode dispatch on /v3/search. Defaults to text mode so existing clients are unaffected. - QueryStrategy interface + TextQueryStrategy (wraps existing logic) + SemanticQueryStrategy (nested kNN on chunks.embedding). - Pluggable EmbeddingClient (OpenAI/Azure, E5) + factory mirrors the content-embedding-job's service layer so query-time and index-time vectors share a contract. - Int8 QuantizationStrategy ports the job's L2-norm-detection branch. - SHA-256 keyed LRU+TTL EmbeddingCache. - application.conf semantic_search block, disabled by default. - Docs: DESIGN, API_SPEC, IMPLEMENTATION_PLAN, FLOWCHART under docs/semantic-search/.

- HybridQueryStrategy.build now falls back to text instead of throwing, so processCount, getCollectionsResult, and external processSearchQuery callers don't crash when given a hybrid-mode DTO. - SemanticQueryStrategy: force fuzzy=false while inheriting filters. prepareFilteredSearchQuery returns a FunctionScoreQueryBuilder, not a BoolQueryBuilder, which silently dropped every filter and left kNN running unconstrained against the whole index. - HybridSearchExecutor: instantiate a fresh SearchProcessor per leg. The shared processor.relevanceSort flag races between parallel text and semantic sub-searches and can flip sort order non-deterministically.

…Strategy getAllFieldsPropertyQuery wraps the multi_match in a BoolQueryBuilder, so the previous getName()-based isFullTextLeg detection missed it. The wrapped should-only bool then slipped into the outer filter clause where its implicit minimum_should_match=1 turned every query term into a hard filter requirement and excluded all matches. Now we re-run the text query builder against a DTO whose properties have had the propertyName='*' entry stripped, so the inherited bool contains only the property filters. Implicit-filter shoulds are also kept as filters to preserve their constraint behavior.

- Skip the default name/lastUpdatedOn sort when search_mode=semantic so the kNN relevance order is preserved. - Return results via the with-score path for semantic mode so callers see the cosine score on each hit. Text mode still returns scores only when fuzzy=true (unchanged contract).

- Add POST /content/v3/enrich endpoint in ContentController - ContentActor routes triggerEnrich to EnrichManager - EnrichManager validates IDs via DataNode.read, derives objectType from mimeType - Emits BE_JOB_REQUEST events with action=enrich to publish.job.request topic - Returns ERR_CONTENT_NOT_FOUND for non-existent identifiers - Add TRIGGER_ENRICH api ID and update routes in both knowlg-service and content-service

coderabbitai · 2026-05-27T00:08:38Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76ddf297-6195-4c93-ab93-c40846a295b4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/semantic-search

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sonarqubecloud · 2026-05-27T00:18:01Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-05-27T00:18:09Z

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Please review the analysis results for each service. Ensure all quality gates are passing before merging.

chethann007 added 9 commits May 24, 2026 00:40

feat: hybrid search via RRF over parallel text + semantic queries

0f7066f

docs: add OpenAPI 3.0 spec for /v3/search with semantic and hybrid modes

cca48b0

fix: add missing space in openapi.yaml description key

927c672

feat: apply minScore to search queries when semantic mode is enabled

c2a080c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: semantic search with hybrid RRF fusion and content enrich API#1238

feat: semantic search with hybrid RRF fusion and content enrich API#1238
chethann007 wants to merge 9 commits into
developfrom
feat/semantic-search

chethann007 commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Review skipped

Uh oh!

sonarqubecloud Bot commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chethann007 commented May 27, 2026

Summary

Semantic search

Hybrid search

Enrich API

Test plan

Uh oh!

coderabbitai Bot commented May 27, 2026

Review skipped

Uh oh!

sonarqubecloud Bot commented May 27, 2026

Quality Gate passed

Uh oh!

github-actions Bot commented May 27, 2026

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant