Observation
trustgraph/retrieval/document_rag/document_rag.py:78-136 does:
results = await asyncio.gather(*[query_concept(v) for v in vectors])
# dedupe by chunk_id, fetch from Garage
# ... pass deduped chunks straight to document_prompt synthesis
There is no reranker stage. Qdrant returns raw cosine top-k, deduplicated, passed straight to the synthesis LLM. No MMR, no diversity penalty, no token-budget cap, no cross-encoder rerank.
Why this matters
Cosine top-k is approximate-and-topical, not answer-aware. For executive-synthesis questions ("Who are X's main competitors?"), the top-3 cosine matches may all be the same paragraph rephrased, or all from the same source document, when the answer needs diversity across sources to be trustworthy.
Issue #878 (open, 2026-05-07) raises the cross-encoder reranking concern. This issue is the same concern, scoped specifically to the document-rag synthesis path (vs the general retrieval surface).
Measured impact
In our Sizzl deployment, raising --doc-limit from 3 to 30 only moved the rubric needle +0.46 points — meaning the additional chunks at the tail of the top-30 were not materially improving synthesis. A reranker that surfaces 10 diverse, high-relevance chunks out of 30 retrieved would likely beat 30-unreranked on both quality and latency (fewer tokens to synthesize).
Proposal
Add a reranker stage between get_docs() and document_prompt() in document_rag.py. Pluggable design:
- Cohere Rerank (API)
- BGE-reranker (local)
- Cross-encoder (local, e.g. ms-marco-MiniLM)
Insertion point: inside get_docs() after the Qdrant gather and before fetch_chunk() — rerank the chunk_id list, fetch fewer chunks, send leaner context to synthesis.
Estimated latency cost: <500ms for local cross-encoder, ~100-300ms for Cohere API.
Estimated quality lift: 5-15% on synthesis rubric metrics (anecdotal, varies by corpus).
Related
Stack
TrustGraph 2.3.21.
Observation
trustgraph/retrieval/document_rag/document_rag.py:78-136does:There is no reranker stage. Qdrant returns raw cosine top-k, deduplicated, passed straight to the synthesis LLM. No MMR, no diversity penalty, no token-budget cap, no cross-encoder rerank.
Why this matters
Cosine top-k is approximate-and-topical, not answer-aware. For executive-synthesis questions ("Who are X's main competitors?"), the top-3 cosine matches may all be the same paragraph rephrased, or all from the same source document, when the answer needs diversity across sources to be trustworthy.
Issue #878 (open, 2026-05-07) raises the cross-encoder reranking concern. This issue is the same concern, scoped specifically to the document-rag synthesis path (vs the general retrieval surface).
Measured impact
In our Sizzl deployment, raising
--doc-limitfrom 3 to 30 only moved the rubric needle +0.46 points — meaning the additional chunks at the tail of the top-30 were not materially improving synthesis. A reranker that surfaces 10 diverse, high-relevance chunks out of 30 retrieved would likely beat 30-unreranked on both quality and latency (fewer tokens to synthesize).Proposal
Add a
rerankerstage betweenget_docs()anddocument_prompt()indocument_rag.py. Pluggable design:Insertion point: inside
get_docs()after the Qdrant gather and beforefetch_chunk()— rerank the chunk_id list, fetch fewer chunks, send leaner context to synthesis.Estimated latency cost: <500ms for local cross-encoder, ~100-300ms for Cohere API.
Estimated quality lift: 5-15% on synthesis rubric metrics (anecdotal, varies by corpus).
Related
Stack
TrustGraph 2.3.21.