🐛 Bugfix: re-embed chunk on content update#3250
Conversation
| "tenant_id is required to re-embed updated chunk content.") | ||
| knowledge_record = get_knowledge_record({ | ||
| "index_name": index_name, | ||
| "tenant_id": tenant_id, |
There was a problem hiding this comment.
每次内容更新都会触发 get_knowledge_record → get_embedding_model_by_id → get_embeddings,批量更新时会产生大量重复的 DB 查询和模型实例化。建议按 index_name 缓存 embedding model 引用。
There was a problem hiding this comment.
Probably not worth it. update_chunk is not a hot loop. It's a single-chunk edit endpoint, so the two lookups (get_knowledge_record, get_embedding_model_by_id) run once per user edit — not per-chunk in a bulk path.
That said, bulk ingest uses a different code path (index_documents in backend/services/vectordatabase_service.py). It already takes embedding_model as a parameter resolved once by the caller — so per-document re-resolution never happens there.
| raise ValueError( | ||
| f"Failed to resolve embedding model {embedding_model_id} " | ||
| f"for index '{index_name}'.") | ||
| embeddings = embedding_model.get_embeddings(new_content) |
There was a problem hiding this comment.
[逻辑漏洞] embedding_model.get_embeddings(new_content) 调用时未对 new_content 的长度进行校验。如果更新的内容超过了 embedding 模型的最大 token 限制,可能导致 embedding 失败或截断。建议在调用前添加内容长度检查,或在异常处理中给出更明确的错误提示。
There was a problem hiding this comment.
create_chunk has the same gap, so this is symmetric with the pre-existing path rather than a regression introduced here. That said, it should be out of scope for this PR. Added as a follow-up bullet.
|
LGTM. The re-embedding logic to backfill vectors for existing knowledge base entries is a sensible feature. No issues found. |
🐛 Bugfix: re-embed chunk on content update so stored vector stays in sync
ElasticSearchService.update_chunkacceptedcontentedits but wrote the new text to Elasticsearch without regenerating the embedding vector.The stored vector silently drifted out of sync with the stored text, degrading pure k-NN search and corrupting the vector half of hybrid search with no operator-visible signal.
This change regenerates and persists the embedding whenever the update payload includes a non-None
contentvalue, mirroring the resolution path thatcreate_chunkalready uses but with fail-loud semantics instead of create's silent-skip (see Follow-ups section).New behavior
contentkey in payload, orcontent=None) take a fast path: no knowledge-record lookup, no embedding call, noembeddingorembedding_model_idwritten.embedding_model_id, to avoid re-using a possibly stale model id.vdb_core.update_chunkis invoked; the error is wrapped through the existingexceptblock asException("Error updating chunk: ..."), matching the prior fail-loud convention.Changes
backend/services/vectordatabase_service.py:ElasticSearchService.update_chunknow detects caller-setcontent, resolves the embedding model from the knowledge base record viaget_knowledge_recordthenget_embedding_model_by_id, callsget_embeddings, and writesembeddingplusembedding_model_idonto the same flat payload sent tovdb_core.update_chunk.backend/consts/model.py:ChunkUpdateRequest.contentgainsmin_length=1so an explicit empty string is rejected at the schema boundary, mirroringChunkCreateRequest.content.content=Noneskip, and six fail-loud branches (missing tenant, missing knowledge record, missingembedding_model_id, unresolvable model,get_embeddingsraising, empty embedding result).test_update_chunk_builds_payload_and_calls_corewas rewritten to use the realChunkUpdateRequest, mock the two new resolution helpers, passtenant_id, and assert the newembedding/embedding_model_idfields.test_update_chunk_wrapped_exceptionwas switched to a metadata-only payload so the core-failure path is still exercised.test/backend/test_model_consts.pyadds the empty-string rejection plus a positive metadata-only construction assertion.Follow-ups
create_chunkfail-loud on embedding resolution failure, matching the convention this PR establishes forupdate_chunk.Today it silently proceeds without an
embeddingfield whentenant_idorembedding_model_idis missing or whenget_embeddingsraises, and only logs a warning.Deferred because the asymmetry is load-bearing — create-skip leaves an auditable missing field while update-stale leaves a wrong field that looks healthy — and flipping create may break callers that tolerate vector-less creates during KB setup churn, so it deserves its own caller-impact review.
embedding_model_name(matching the index mapping) butcreate_chunkwritesembedding_model_id, which lands as a dynamic field.The mismatch originates in
create_chunkand predates this PR;update_chunkmirrorscreate_chunkhere, so this PR perpetuates the mismatch rather than introducing or fixing it.ChunkUpdateRequestandChunkCreateRequest.Today neither schema caps
contentlength, so an oversize payload reaches the embedding provider directly; if the provider truncates server-side instead of erroring, a vector for only the prefix is stored against the full text, reintroducing the same vector/text drift class this PR addresses but sourced from oversize input instead of skipped re-embedding.