-
Notifications
You must be signed in to change notification settings - Fork 660
🐛 Bugfix: re-embed chunk on content update #3250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1907,8 +1907,15 @@ def update_chunk( | |
| user_id: Optional[str] = None, | ||
| tenant_id: Optional[str] = None, | ||
| ): | ||
| """ | ||
| Update a chunk document. | ||
| """Update a chunk document. | ||
|
|
||
| When the payload changes ``content``, the embedding vector is | ||
| regenerated using the knowledge base's currently configured | ||
| embedding model and persisted alongside the text so that the | ||
| stored vector cannot drift out of sync with the stored text. | ||
| Metadata-only updates (no ``content`` in payload, or ``content=None``) | ||
| skip the embedding call. Any failure during embedding resolution or | ||
| generation aborts the entire update. | ||
| """ | ||
| try: | ||
| update_fields = chunk_request.dict( | ||
|
|
@@ -1928,6 +1935,35 @@ def update_chunk( | |
| if not update_payload: | ||
| raise ValueError("No update fields supplied.") | ||
|
|
||
| new_content = update_fields.get("content") | ||
| if "content" in update_fields and new_content is not None: | ||
| if not tenant_id: | ||
| raise ValueError( | ||
| "tenant_id is required to re-embed updated chunk content.") | ||
| knowledge_record = get_knowledge_record({ | ||
| "index_name": index_name, | ||
| "tenant_id": tenant_id, | ||
| }) | ||
| if not knowledge_record: | ||
| raise ValueError( | ||
| f"Knowledge base record for index '{index_name}' not found.") | ||
| embedding_model_id = knowledge_record.get("embedding_model_id") | ||
| if not embedding_model_id: | ||
| raise ValueError( | ||
| f"No embedding model configured for index '{index_name}'.") | ||
| embedding_model, _ = get_embedding_model_by_id( | ||
| tenant_id, embedding_model_id) | ||
| if embedding_model is None: | ||
| raise ValueError( | ||
| f"Failed to resolve embedding model {embedding_model_id} " | ||
| f"for index '{index_name}'.") | ||
| embeddings = embedding_model.get_embeddings(new_content) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [逻辑漏洞]
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| if not embeddings: | ||
| raise ValueError( | ||
| "Embedding generation returned no vector for updated content.") | ||
| update_payload["embedding"] = embeddings[0] | ||
| update_payload["embedding_model_id"] = embedding_model_id | ||
|
|
||
| result = vdb_core.update_chunk( | ||
| index_name, chunk_id, update_payload) | ||
| return { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
每次内容更新都会触发
get_knowledge_record→get_embedding_model_by_id→get_embeddings,批量更新时会产生大量重复的 DB 查询和模型实例化。建议按index_name缓存 embedding model 引用。There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not worth it.
update_chunkis not a hot loop. It's a single-chunk edit endpoint, so the two lookups (get_knowledge_record,get_embedding_model_by_id) run once per user edit — not per-chunk in a bulk path.That said, bulk ingest uses a different code path (
index_documentsinbackend/services/vectordatabase_service.py). It already takes embedding_model as a parameter resolved once by the caller — so per-document re-resolution never happens there.