fix for indexing issue with knowledge bases#274
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds an Alembic migration creating/dropping three HNSW vector indexes and one GIN index; refactors vector search into HNSW candidate CTEs with explicit vector casts and limited candidates; tightens token updates to only NULL; updates image retrieval to require both clip and dino embeddings and run sequential retrieval. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
...db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py
Fixed
Show fixed
Hide fixed
...db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py
Fixed
Show fixed
Hide fixed
| # revision identifiers, used by Alembic. | ||
| revision: str = 'e8f2a1c3b5d9' | ||
| down_revision: Union[str, None] = 'c7a9e2f4b1d0' | ||
| branch_labels: Union[str, Sequence[str], None] = None |
...db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py
Fixed
Show fixed
Hide fixed
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py`:
- Around line 21-69: The migration uses CREATE/DROP INDEX CONCURRENTLY inside
upgrade() and downgrade() via op.execute, which fails because CONCURRENTLY
cannot run inside a transaction; wrap each op.execute that creates or drops an
index in both upgrade() and downgrade() with an autocommit block using
op.get_context().autocommit_block() so each CREATE INDEX CONCURRENTLY / DROP
INDEX CONCURRENTLY runs outside the surrounding transaction (i.e., replace
direct op.execute(...) calls for ix_kbe_embedding_vector_hnsw_cosine,
ix_kbe_embedding_vector_hnsw_l2, ix_kbe_embedding_vector_1_hnsw_cosine, and
ix_kbe_token_gin with the same SQL executed inside with
op.get_context().autocommit_block():).
In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`:
- Around line 92-123: The ANN candidate CTE hnsw_candidates is built across the
entire KnowledgeBaseEmbeddings table before applying the KB and metadata filters
in vector_results, so its LIMIT (:limit * 10) can drop all matches for the
requested KB; modify the query so the KB and metadata filters are applied inside
hnsw_candidates (e.g., join hnsw_candidates to KnowledgeBaseDocuments or
restrict by document_id IN (SELECT id FROM KnowledgeBaseDocuments WHERE
knowledge_base_id = :kb_id AND <metadata_filter_clause_inner>)) so the pre-LIMIT
candidate set is already scoped to the requested KB and
metadata_filter_clause_inner, then keep the subsequent vector_results CTE and
its final LIMIT :limit unchanged.
- Around line 98-103: The query uses raw columns (embedding_vector /
embedding_vector_1) in the distance expression and ORDER BY, so PostgreSQL can't
use the HNSW expression indexes; update the distance calculation and ORDER BY to
use the exact casted expressions used by the indexes (e.g.
(embedding_vector::vector(512)) and (embedding_vector_1::vector(1024))) wherever
KnowledgeBaseEmbeddings.embedding_vector or embedding_vector_1 are referenced in
generate_query.py so the planner can utilize the HNSW indexes and avoid seq
scans.
In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py`:
- Around line 47-59: The CLIP->DINO flow should not call image_retrieve_dino
when the CLIP stage produced no candidates; detect when clip_results is empty in
the block inside image_rag_retrieve (where clip_results is built) and
short-circuit (e.g., return an empty list/result or the appropriate no-match
response) instead of calling image_retrieve_dino, because image_retrieve_dino
(and get_image_embedding_dino) will drop the reference_ids filter when given an
empty reference_id_list and perform an unwanted full-KB DINO search.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b5efc6c6-4532-43af-9115-3815f5e884fe
📒 Files selected for processing (3)
wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.pywavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.pywavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py
...db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py
Outdated
Show resolved
Hide resolved
wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py
Show resolved
Hide resolved
wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py
Outdated
Show resolved
Hide resolved
...nt/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py (2)
252-272:⚠️ Potential issue | 🟠 MajorScale the DINO ANN window with the requested page.
hnsw_candidatesonly pulls:limit * 10rows, then the outer query appliesOFFSET :offset. Once users page deeper, valid candidates are dropped before the final sort.image_rag_retrieve.py:47-63forwardsoffset/limitinto this query, so this will surface as incomplete later pages.💡 Suggested change
effective_limit = limit if limit is not None else int(params.get('top_k', 10)) reference_id_list: List[Any] = params.get('reference_id_list', []) effective_offset = offset if offset is not None else 0 + candidate_limit = (effective_offset + effective_limit) * 10 ... params = { 'query_embedding': query_embeddings, 'kb_id': kb_id, - 'top_k': effective_limit, 'reference_ids': processed_reference_ids, 'offset': effective_offset, 'limit': effective_limit, + 'candidate_limit': candidate_limit, } ... - LIMIT :limit * 10 + LIMIT :candidate_limitAlso applies to: 290-322
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py` around lines 252 - 272, The DINO ANN candidate window is fixed to ":limit * 10" causing valid results to be dropped when paginating; update generate_query.py to scale the ANN retrieval window by page so hnsw_candidates fetches enough rows for the requested offset. Concretely, compute an ann_window (e.g., ann_window = (effective_offset + effective_limit) * 10 or ann_window = (math.ceil((effective_offset + 1)/effective_limit) * effective_limit * 10)) using the existing effective_offset and effective_limit variables, pass that ann_window into the query parameters used by hnsw_candidates (instead of plain limit*10), and ensure the params dict (where 'top_k'/'reference_ids' are set) supplies this scaled window so deeper pages retain enough candidates before the outer OFFSET is applied.
53-70:⚠️ Potential issue | 🟠 MajorDon't page after trimming both rerank sources to one page.
vector_resultsandkeyword_resultsare capped at:limitbefore the finalOFFSET :offset, so page 2+ is computed from page 1 candidates only. The new ANN CTE has the same problem. At minimum, size those inner windows fromoffset + limitand oversample that value forhnsw_candidates.💡 Suggested change
effective_offset = offset or 0 + page_window = effective_offset + effective_limit # Prepare query parameters query_params = { 'query_embed': str(query_embeddings[0]), 'threshold': threshold, 'kb_id': kb_id, 'vector_weight': vector_weight, 'keyword_weight': keyword_weight, 'query': query, 'offset': effective_offset, 'limit': effective_limit, + 'page_window': page_window, + 'candidate_limit': page_window * 10, } ... - LIMIT :limit * 10 + LIMIT :candidate_limit ... - LIMIT :limit + LIMIT :page_window ... - LIMIT :limit + LIMIT :page_windowAlso applies to: 98-157
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py` around lines 53 - 70, The current query trims vector_results and keyword_results to limit before applying OFFSET, causing page N>1 to only consider page 1 candidates; update the logic in generate_query (references: vector_results, keyword_results, hnsw_candidates and the ANN CTE) to request/compute inner windows sized at least offset + limit (and preferably oversample, e.g. multiply by a small factor) so the subsequent OFFSET/:offset and final LIMIT/:limit operate on a superset of candidates; adjust any places building query_params or CTE limits to use this expanded window size instead of just effective_limit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`:
- Around line 252-272: The DINO ANN candidate window is fixed to ":limit * 10"
causing valid results to be dropped when paginating; update generate_query.py to
scale the ANN retrieval window by page so hnsw_candidates fetches enough rows
for the requested offset. Concretely, compute an ann_window (e.g., ann_window =
(effective_offset + effective_limit) * 10 or ann_window =
(math.ceil((effective_offset + 1)/effective_limit) * effective_limit * 10))
using the existing effective_offset and effective_limit variables, pass that
ann_window into the query parameters used by hnsw_candidates (instead of plain
limit*10), and ensure the params dict (where 'top_k'/'reference_ids' are set)
supplies this scaled window so deeper pages retain enough candidates before the
outer OFFSET is applied.
- Around line 53-70: The current query trims vector_results and keyword_results
to limit before applying OFFSET, causing page N>1 to only consider page 1
candidates; update the logic in generate_query (references: vector_results,
keyword_results, hnsw_candidates and the ANN CTE) to request/compute inner
windows sized at least offset + limit (and preferably oversample, e.g. multiply
by a small factor) so the subsequent OFFSET/:offset and final LIMIT/:limit
operate on a superset of candidates; adjust any places building query_params or
CTE limits to use this expanded window size instead of just effective_limit.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7e83758b-d51a-49dd-b1de-1e44c90a7ce5
📒 Files selected for processing (2)
wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.pywavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py
🚧 Files skipped from review as they are similar to previous changes (1)
- wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py
|
|
||
|
|
||
| # revision identifiers, used by Alembic. | ||
| revision: str = 'e8f2a1c3b5d9' |
| # revision identifiers, used by Alembic. | ||
| revision: str = 'e8f2a1c3b5d9' | ||
| down_revision: Union[str, None] = 'c7a9e2f4b1d0' | ||
| branch_labels: Union[str, Sequence[str], None] = None |
| revision: str = 'e8f2a1c3b5d9' | ||
| down_revision: Union[str, None] = 'c7a9e2f4b1d0' | ||
| branch_labels: Union[str, Sequence[str], None] = None | ||
| depends_on: Union[str, Sequence[str], None] = None |
Summary by CodeRabbit
Performance Improvements
Bug Fixes
Improvements