fix(retriever): use file content as rerank fallback when file abstract is empty#2791
Open
bhnan wants to merge 1 commit into
Open
fix(retriever): use file content as rerank fallback when file abstract is empty#2791bhnan wants to merge 1 commit into
bhnan wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When a file candidate (level-2 node) has an empty abstract, the rerank step would previously pass an empty string to the rerank API, which could cause the entire batch to fail or fall back to vector scores for all candidates.
This PR implements two fixes:
File content fallback: When a level-2 file node has an empty abstract, read the original file content (up to 4000 chars) as a fallback rerank document. Only applies to text-compatible file extensions (.txt, .md, .py, .js, .json, etc.)
Empty document filtering: Even with the file-content fallback, some documents can still be empty (non-level-2 nodes, unsupported suffixes, read failures). Filter out empty documents before calling
rerank_batch()so one empty entry does not drag down the entire batch. Empty slots retain their original vector (fallback) scores.Related Issue
Fixes #2330
Type of Change
Changes Made
FILE_RERANK_FALLBACK_MAX_CHARS = 4000class constant_build_rerank_document()and_build_rerank_documents()methods to build rerank docs with file content fallbackstr(r.get("abstract", ""))calls with async_build_rerank_documents()calls_rerank_scores()to filter empty documents beforererank_batch()and map scores back to original positionsTesting
test_retrieve_uses_file_content_fallback_for_empty_level_two_abstracttest_rerank_scores_filters_empty_documents_before_reranktest_hierarchical_retriever_rerank.pypass.venv/bin/pytest tests/retrieve/test_hierarchical_retriever_rerank.py -v # 12 passed, 4 warnings in 3.84s