Skip to content

Improve component search scoring relevance#2426

Open
Mbeaulne wants to merge 1 commit into
06-18-add_synonym_groupsfrom
06-18-improve_component_search_scoring_relevance
Open

Improve component search scoring relevance#2426
Mbeaulne wants to merge 1 commit into
06-18-add_synonym_groupsfrom
06-18-improve_component_search_scoring_relevance

Conversation

@Mbeaulne

@Mbeaulne Mbeaulne commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Description

Improves the lexical search scoring model with three enhancements:

  • Prefix match boost: Partial query terms (e.g. classif) now rank components where the term is a prefix of a token higher than components where it appears only as a mid-string substring.
  • IDF-style rare token weighting: Query tokens that match fewer components are weighted more heavily than common tokens, preventing high-frequency terms from dominating scores. For example, searching train xgboost will surface components mentioning xgboost above generic train matches.
  • All-query-tokens bonus: When a component matches every token in the query (across any fields), it receives an additional score bonus, ensuring more complete matches rank above partial ones.

The phrase match bonus previously applied only to the name field has been extended to all search fields using per-field bonus weights (FIELD_PHRASE_BONUS).

The tokenize function has been refactored to extract a reusable uniqueTokens helper, and a new requiredQueryTokens function produces stemmed, deduplicated tokens from the raw query without synonym expansion, used for phrase and completeness checks.

Related Issue and Pull requests

Type of Change

  • Bug fix
  • Improvement
  • Cleanup/Refactor
  • Breaking change
  • Documentation update

Checklist

  • I have tested this does not break current pipelines / runs functionality
  • I have tested the changes on staging

Test Instructions

Three new unit tests cover the added behaviors:

  1. Search classif — verify classify_rows ranks above a component with classif as a non-prefix substring.
  2. Search train xgboost — verify the component with the rare token xgboost ranks first.
  3. Search train model — verify the component matching both tokens across fields ranks above one matching only train.

Run the test suite with:

npx jest componentSearchIndex

Additional Comments

Token weights are computed per-query using a smoothed inverse document frequency: 1 + log((N+1) / (df+1)), where N is the index size and df is the number of entries containing the token.

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

🎩 Preview

A preview build has been created at: 06-18-improve_component_search_scoring_relevance/d4d0a60

Comment thread src/services/componentSearchIndex.ts Outdated
Comment thread src/services/componentSearchIndex.ts
Comment thread src/services/componentSearchIndex.test.ts Outdated
Comment thread src/services/componentSearchIndex.test.ts Outdated
@Mbeaulne Mbeaulne force-pushed the 06-18-add_synonym_groups branch from 2655160 to dce82a1 Compare June 18, 2026 19:12
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from bbd53a7 to 36032c1 Compare June 18, 2026 19:12
@Mbeaulne Mbeaulne force-pushed the 06-18-add_synonym_groups branch from dce82a1 to f5a29c0 Compare June 18, 2026 20:28
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from 36032c1 to d8e31f8 Compare June 18, 2026 20:28
@Mbeaulne Mbeaulne force-pushed the 06-18-improve_component_search_scoring_relevance branch from d8e31f8 to d4d0a60 Compare June 18, 2026 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant