Improve component search scoring relevance#2426
Open
Mbeaulne wants to merge 1 commit into
Open
Conversation
🎩 PreviewA preview build has been created at: |
This was referenced Jun 18, 2026
Collaborator
Author
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
8 tasks
This was referenced Jun 18, 2026
Mbeaulne
commented
Jun 18, 2026
Mbeaulne
commented
Jun 18, 2026
Mbeaulne
commented
Jun 18, 2026
Mbeaulne
commented
Jun 18, 2026
2655160 to
dce82a1
Compare
bbd53a7 to
36032c1
Compare
dce82a1 to
f5a29c0
Compare
36032c1 to
d8e31f8
Compare
d8e31f8 to
d4d0a60
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Description
Improves the lexical search scoring model with three enhancements:
classif) now rank components where the term is a prefix of a token higher than components where it appears only as a mid-string substring.train xgboostwill surface components mentioningxgboostabove generictrainmatches.The phrase match bonus previously applied only to the
namefield has been extended to all search fields using per-field bonus weights (FIELD_PHRASE_BONUS).The
tokenizefunction has been refactored to extract a reusableuniqueTokenshelper, and a newrequiredQueryTokensfunction produces stemmed, deduplicated tokens from the raw query without synonym expansion, used for phrase and completeness checks.Related Issue and Pull requests
Type of Change
Checklist
Test Instructions
Three new unit tests cover the added behaviors:
classif— verifyclassify_rowsranks above a component withclassifas a non-prefix substring.train xgboost— verify the component with the rare tokenxgboostranks first.train model— verify the component matching both tokens across fields ranks above one matching onlytrain.Run the test suite with:
Additional Comments
Token weights are computed per-query using a smoothed inverse document frequency:
1 + log((N+1) / (df+1)), whereNis the index size anddfis the number of entries containing the token.