Skip to content

fix for indexing issue with knowledge bases#274

Open
vizsatiz wants to merge 4 commits intodevelopfrom
embedding-index-issue
Open

fix for indexing issue with knowledge bases#274
vizsatiz wants to merge 4 commits intodevelopfrom
embedding-index-issue

Conversation

@vizsatiz
Copy link
Copy Markdown
Member

@vizsatiz vizsatiz commented Apr 10, 2026

Summary by CodeRabbit

  • Performance Improvements

    • Added new vector and token indexes and refactored search queries to a two-stage approximate nearest-neighbor flow, significantly speeding up vector-based search and retrieval.
  • Bug Fixes

    • Tightened token update logic to only update missing tokens, preventing unintended overwrites.
  • Improvements

    • More reliable image retrieval: embedding extraction now validates presence of multiple embeddings and uses a two-stage retrieval to improve relevance and reduce empty results.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6770c15a-d0e9-4799-8006-75a1f8b088f3

📥 Commits

Reviewing files that changed from the base of the PR and between da2062b and 988f23f.

📒 Files selected for processing (1)
  • wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py

📝 Walkthrough

Walkthrough

Adds an Alembic migration creating/dropping three HNSW vector indexes and one GIN index; refactors vector search into HNSW candidate CTEs with explicit vector casts and limited candidates; tightens token updates to only NULL; updates image retrieval to require both clip and dino embeddings and run sequential retrieval.

Changes

Cohort / File(s) Summary
Database Migration
wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py
New Alembic migration that creates three HNSW indexes (on embedding_vector::vector(512) for cosine/L2 and embedding_vector_1::vector(1024) for cosine) with WITH (m=16, ef_construction=64) and a GIN index on token; downgrade() drops them using DROP INDEX CONCURRENTLY IF EXISTS.
Query Generation / Vector Search
wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py
Refactored vector search to a two-stage flow: hnsw_candidates CTE performs approximate neighbor selection (explicit ::vector(512/1024) casts, metadata filter for candidate stage, candidate limiting); final result joins candidates to documents and computes vector_score = 1 - distance. get_update_tokens_query now updates only rows where token IS NULL.
Image Retrieval Refactor
wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py
Removed persistent self.reranked_image; extraction of clip_embedding and dino_embedding is conditional; retrieval proceeds only when both embeddings present: run clip retrieval, build reference ID list, then run dino retrieval constrained to those IDs; return [] on missing embeddings or empty intermediate results.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client as Client
participant Service as ImageRAGService
participant QueryGen as QueryGenerator
participant DB as Postgres(HNSW)
participant DocStore as DocumentStore

Client->>Service: send image retrieval request
Service->>Service: extract clip_embedding and dino_embedding
alt both embeddings present
    Service->>QueryGen: build hnsw_candidates CTE (clip ::vector(512))
    QueryGen->>DB: run approximate neighbor query (limit)
    DB-->>QueryGen: candidate ids + distances
    QueryGen->>DocStore: join candidates -> vector_results (compute vector_score)
    DocStore-->>Service: clip retrieval results (reference IDs)
    Service->>QueryGen: build constrained hnsw_candidates (dino ::vector(1024), reference IDs)
    QueryGen->>DB: run final scoring query
    DB-->>Service: dino-ranked results
    Service-->>Client: combined results
else missing embeddings
    Service-->>Client: []
end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through vectors, soft and bright,
I build small CTE tunnels in the night,
Clip finds friends, Dino refines the line,
Indexes hum, neighbors align,
I nibble results — a tasty find. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix for indexing issue with knowledge bases' directly describes the main changes: adding HNSW indexes in a migration and refactoring vector search queries to use efficient approximate distance computation with HNSW candidates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch embedding-index-issue

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

# revision identifiers, used by Alembic.
revision: str = 'e8f2a1c3b5d9'
down_revision: Union[str, None] = 'c7a9e2f4b1d0'
branch_labels: Union[str, Sequence[str], None] = None
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py`:
- Around line 21-69: The migration uses CREATE/DROP INDEX CONCURRENTLY inside
upgrade() and downgrade() via op.execute, which fails because CONCURRENTLY
cannot run inside a transaction; wrap each op.execute that creates or drops an
index in both upgrade() and downgrade() with an autocommit block using
op.get_context().autocommit_block() so each CREATE INDEX CONCURRENTLY / DROP
INDEX CONCURRENTLY runs outside the surrounding transaction (i.e., replace
direct op.execute(...) calls for ix_kbe_embedding_vector_hnsw_cosine,
ix_kbe_embedding_vector_hnsw_l2, ix_kbe_embedding_vector_1_hnsw_cosine, and
ix_kbe_token_gin with the same SQL executed inside with
op.get_context().autocommit_block():).

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`:
- Around line 92-123: The ANN candidate CTE hnsw_candidates is built across the
entire KnowledgeBaseEmbeddings table before applying the KB and metadata filters
in vector_results, so its LIMIT (:limit * 10) can drop all matches for the
requested KB; modify the query so the KB and metadata filters are applied inside
hnsw_candidates (e.g., join hnsw_candidates to KnowledgeBaseDocuments or
restrict by document_id IN (SELECT id FROM KnowledgeBaseDocuments WHERE
knowledge_base_id = :kb_id AND <metadata_filter_clause_inner>)) so the pre-LIMIT
candidate set is already scoped to the requested KB and
metadata_filter_clause_inner, then keep the subsequent vector_results CTE and
its final LIMIT :limit unchanged.
- Around line 98-103: The query uses raw columns (embedding_vector /
embedding_vector_1) in the distance expression and ORDER BY, so PostgreSQL can't
use the HNSW expression indexes; update the distance calculation and ORDER BY to
use the exact casted expressions used by the indexes (e.g.
(embedding_vector::vector(512)) and (embedding_vector_1::vector(1024))) wherever
KnowledgeBaseEmbeddings.embedding_vector or embedding_vector_1 are referenced in
generate_query.py so the planner can utilize the HNSW indexes and avoid seq
scans.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py`:
- Around line 47-59: The CLIP->DINO flow should not call image_retrieve_dino
when the CLIP stage produced no candidates; detect when clip_results is empty in
the block inside image_rag_retrieve (where clip_results is built) and
short-circuit (e.g., return an empty list/result or the appropriate no-match
response) instead of calling image_retrieve_dino, because image_retrieve_dino
(and get_image_embedding_dino) will drop the reference_ids filter when given an
empty reference_id_list and perform an unwanted full-KB DINO search.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b5efc6c6-4532-43af-9115-3815f5e884fe

📥 Commits

Reviewing files that changed from the base of the PR and between a3a7752 and e5dbd45.

📒 Files selected for processing (3)
  • wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2026_04_10_1000-e8f2a1c3b5d9_add_hnsw_index_on_embeddings.py
  • wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py
  • wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py

@vizsatiz vizsatiz changed the title fix for migration fix for indexing issue with knowledge bases Apr 10, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py (2)

252-272: ⚠️ Potential issue | 🟠 Major

Scale the DINO ANN window with the requested page.

hnsw_candidates only pulls :limit * 10 rows, then the outer query applies OFFSET :offset. Once users page deeper, valid candidates are dropped before the final sort. image_rag_retrieve.py:47-63 forwards offset/limit into this query, so this will surface as incomplete later pages.

💡 Suggested change
         effective_limit = limit if limit is not None else int(params.get('top_k', 10))
         reference_id_list: List[Any] = params.get('reference_id_list', [])
         effective_offset = offset if offset is not None else 0
+        candidate_limit = (effective_offset + effective_limit) * 10
...
         params = {
             'query_embedding': query_embeddings,
             'kb_id': kb_id,
-            'top_k': effective_limit,
             'reference_ids': processed_reference_ids,
             'offset': effective_offset,
             'limit': effective_limit,
+            'candidate_limit': candidate_limit,
         }
...
-            LIMIT :limit * 10
+            LIMIT :candidate_limit

Also applies to: 290-322

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`
around lines 252 - 272, The DINO ANN candidate window is fixed to ":limit * 10"
causing valid results to be dropped when paginating; update generate_query.py to
scale the ANN retrieval window by page so hnsw_candidates fetches enough rows
for the requested offset. Concretely, compute an ann_window (e.g., ann_window =
(effective_offset + effective_limit) * 10 or ann_window =
(math.ceil((effective_offset + 1)/effective_limit) * effective_limit * 10))
using the existing effective_offset and effective_limit variables, pass that
ann_window into the query parameters used by hnsw_candidates (instead of plain
limit*10), and ensure the params dict (where 'top_k'/'reference_ids' are set)
supplies this scaled window so deeper pages retain enough candidates before the
outer OFFSET is applied.

53-70: ⚠️ Potential issue | 🟠 Major

Don't page after trimming both rerank sources to one page.

vector_results and keyword_results are capped at :limit before the final OFFSET :offset, so page 2+ is computed from page 1 candidates only. The new ANN CTE has the same problem. At minimum, size those inner windows from offset + limit and oversample that value for hnsw_candidates.

💡 Suggested change
         effective_offset = offset or 0
+        page_window = effective_offset + effective_limit

         # Prepare query parameters
         query_params = {
             'query_embed': str(query_embeddings[0]),
             'threshold': threshold,
             'kb_id': kb_id,
             'vector_weight': vector_weight,
             'keyword_weight': keyword_weight,
             'query': query,
             'offset': effective_offset,
             'limit': effective_limit,
+            'page_window': page_window,
+            'candidate_limit': page_window * 10,
         }
...
-                LIMIT :limit * 10
+                LIMIT :candidate_limit
...
-                LIMIT :limit
+                LIMIT :page_window
...
-                LIMIT :limit
+                LIMIT :page_window

Also applies to: 98-157

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`
around lines 53 - 70, The current query trims vector_results and keyword_results
to limit before applying OFFSET, causing page N>1 to only consider page 1
candidates; update the logic in generate_query (references: vector_results,
keyword_results, hnsw_candidates and the ANN CTE) to request/compute inner
windows sized at least offset + limit (and preferably oversample, e.g. multiply
by a small factor) so the subsequent OFFSET/:offset and final LIMIT/:limit
operate on a superset of candidates; adjust any places building query_params or
CTE limits to use this expanded window size instead of just effective_limit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py`:
- Around line 252-272: The DINO ANN candidate window is fixed to ":limit * 10"
causing valid results to be dropped when paginating; update generate_query.py to
scale the ANN retrieval window by page so hnsw_candidates fetches enough rows
for the requested offset. Concretely, compute an ann_window (e.g., ann_window =
(effective_offset + effective_limit) * 10 or ann_window =
(math.ceil((effective_offset + 1)/effective_limit) * effective_limit * 10))
using the existing effective_offset and effective_limit variables, pass that
ann_window into the query parameters used by hnsw_candidates (instead of plain
limit*10), and ensure the params dict (where 'top_k'/'reference_ids' are set)
supplies this scaled window so deeper pages retain enough candidates before the
outer OFFSET is applied.
- Around line 53-70: The current query trims vector_results and keyword_results
to limit before applying OFFSET, causing page N>1 to only consider page 1
candidates; update the logic in generate_query (references: vector_results,
keyword_results, hnsw_candidates and the ANN CTE) to request/compute inner
windows sized at least offset + limit (and preferably oversample, e.g. multiply
by a small factor) so the subsequent OFFSET/:offset and final LIMIT/:limit
operate on a superset of candidates; adjust any places building query_params or
CTE limits to use this expanded window size instead of just effective_limit.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7e83758b-d51a-49dd-b1de-1e44c90a7ce5

📥 Commits

Reviewing files that changed from the base of the PR and between 82ec696 and da2062b.

📒 Files selected for processing (2)
  • wavefront/server/modules/knowledge_base_module/knowledge_base_module/queries/generate_query.py
  • wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • wavefront/server/modules/knowledge_base_module/knowledge_base_module/services/image_rag_retrieve.py



# revision identifiers, used by Alembic.
revision: str = 'e8f2a1c3b5d9'
# revision identifiers, used by Alembic.
revision: str = 'e8f2a1c3b5d9'
down_revision: Union[str, None] = 'c7a9e2f4b1d0'
branch_labels: Union[str, Sequence[str], None] = None
revision: str = 'e8f2a1c3b5d9'
down_revision: Union[str, None] = 'c7a9e2f4b1d0'
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant