Skip to content

fix(retriever): use file content as rerank fallback when file abstract is empty#2791

Open
bhnan wants to merge 1 commit into
volcengine:mainfrom
bhnan:fix/rerank-empty-abstract
Open

fix(retriever): use file content as rerank fallback when file abstract is empty#2791
bhnan wants to merge 1 commit into
volcengine:mainfrom
bhnan:fix/rerank-empty-abstract

Conversation

@bhnan

@bhnan bhnan commented Jun 23, 2026

Copy link
Copy Markdown

Description

When a file candidate (level-2 node) has an empty abstract, the rerank step would previously pass an empty string to the rerank API, which could cause the entire batch to fail or fall back to vector scores for all candidates.

This PR implements two fixes:

  1. File content fallback: When a level-2 file node has an empty abstract, read the original file content (up to 4000 chars) as a fallback rerank document. Only applies to text-compatible file extensions (.txt, .md, .py, .js, .json, etc.)

  2. Empty document filtering: Even with the file-content fallback, some documents can still be empty (non-level-2 nodes, unsupported suffixes, read failures). Filter out empty documents before calling rerank_batch() so one empty entry does not drag down the entire batch. Empty slots retain their original vector (fallback) scores.

Related Issue

Fixes #2330

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Changes Made

  • Added FILE_RERANK_FALLBACK_MAX_CHARS = 4000 class constant
  • Added _build_rerank_document() and _build_rerank_documents() methods to build rerank docs with file content fallback
  • Replaced all 3 direct str(r.get("abstract", "")) calls with async _build_rerank_documents() calls
  • Modified _rerank_scores() to filter empty documents before rerank_batch() and map scores back to original positions
  • Added 2 new test cases covering both fixes
  • All 12 tests pass

Testing

  • Added test_retrieve_uses_file_content_fallback_for_empty_level_two_abstract
  • Added test_rerank_scores_filters_empty_documents_before_rerank
  • All 12 tests in test_hierarchical_retriever_rerank.py pass
.venv/bin/pytest tests/retrieve/test_hierarchical_retriever_rerank.py -v
# 12 passed, 4 warnings in 3.84s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

OpenAI-compatible rerank falls back for whole batch when candidates contain empty abstracts

1 participant