Add source-diverse AI rerank candidate pool#2429
Conversation
🎩 PreviewA preview build has been created at: |
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
| appendUniqueMatches( | ||
| candidates, | ||
| seenDigests, | ||
| sampleEvenly(sortedIndex, AI_CANDIDATE_LIMIT).map(indexEntryToLexicalMatch), |
There was a problem hiding this comment.
🤖 This is an AI-generated code review comment.
[MEDIUM] The source-diversity layer (buildSourceDiverseBrowseMatches) and this alphabetical-fill layer run unconditionally, padding the AI candidate pool toward the cap (AI_CANDIDATE_LIMIT = 80) with alphabetically-early, lexically-irrelevant components even when lexical search already returned a strong source-spanning set.
This is not a correctness bug — lexical hits are preserved first (appended before the fill layers), and RERANK_EXCLUSION_THRESHOLD keeps junk from being badged. But it sends more low-signal candidates to the billed reranker on every rerank (cost/latency), and can surface irrelevant items in the unbadged tail.
Optional fix: only run the fill layer when the pool is under a smaller floor, or skip the alphabetical fill when lexical + diversity already produced a source-diverse set. Worth confirming reranker cost at 80 vs 50 candidates is acceptable.
d363ca7 to
60b076d
Compare
97e37c0 to
4f20ff2
Compare
60b076d to
455266e
Compare
4f20ff2 to
638c7b7
Compare
455266e to
4a246ee
Compare
638c7b7 to
3f91762
Compare
4a246ee to
8cc6222
Compare
3f91762 to
790c426
Compare
8cc6222 to
88f3546
Compare
790c426 to
554c927
Compare

Description
The AI rerank candidate pool now uses a source-diverse selection strategy instead of relying purely on lexical hits. Previously, the candidate pool was capped at 50 results drawn entirely from lexical matches, falling back to a plain alphabetical slice when no matches were found. This meant components from underrepresented sources (e.g. user uploads) could be crowded out entirely when a query produced many strong lexical hits from a single source.
The new approach builds the candidate pool in three layers:
The rerank base is now always set to
aiCandidateMatchesrather than switching between lexical and AI candidate lists depending on whether lexical results existed.Related Issue and Pull requests
Type of Change
Checklist
Screenshots (if applicable)
Test Instructions
componentSearchV2Logic.test.tsto confirm the new source-diversity test passes.Additional Comments
The new
sampleEvenlyhelper picks items at uniform intervals so that the browse sample is representative of the full sorted list rather than just the top entries.