🐛 Bugfix: re-embed chunk on content update by Lingxi-Li · Pull Request #3250 · ModelEngine-Group/nexent

Lingxi-Li · 2026-06-17T04:02:49Z

🐛 Bugfix: re-embed chunk on content update so stored vector stays in sync

ElasticSearchService.update_chunk accepted content edits but wrote the new text to Elasticsearch without regenerating the embedding vector.
The stored vector silently drifted out of sync with the stored text, degrading pure k-NN search and corrupting the vector half of hybrid search with no operator-visible signal.
This change regenerates and persists the embedding whenever the update payload includes a non-None content value, mirroring the resolution path that create_chunk already uses but with fail-loud semantics instead of create's silent-skip (see Follow-ups section).

New behavior

Metadata-only updates (no content key in payload, or content=None) take a fast path: no knowledge-record lookup, no embedding call, no embedding or embedding_model_id written.
Content updates re-resolve the embedding model from the KB's current record on every call, not from the existing chunk's stored embedding_model_id, to avoid re-using a possibly stale model id.
Any failure during model resolution or embedding generation aborts the entire update before vdb_core.update_chunk is invoked; the error is wrapped through the existing except block as Exception("Error updating chunk: ..."), matching the prior fail-loud convention.

Changes

backend/services/vectordatabase_service.py: ElasticSearchService.update_chunk now detects caller-set content, resolves the embedding model from the knowledge base record via get_knowledge_record then get_embedding_model_by_id, calls get_embeddings, and writes embedding plus embedding_model_id onto the same flat payload sent to vdb_core.update_chunk.
backend/consts/model.py: ChunkUpdateRequest.content gains min_length=1 so an explicit empty string is rejected at the schema boundary, mirroring ChunkCreateRequest.content.
Tests: nine new service tests covering the metadata-only fast path, content re-embedding, the content=None skip, and six fail-loud branches (missing tenant, missing knowledge record, missing embedding_model_id, unresolvable model, get_embeddings raising, empty embedding result).
Existing test updates:
1. test_update_chunk_builds_payload_and_calls_core was rewritten to use the real ChunkUpdateRequest, mock the two new resolution helpers, pass tenant_id, and assert the new embedding / embedding_model_id fields.
2. test_update_chunk_wrapped_exception was switched to a metadata-only payload so the core-failure path is still exercised.
3. test/backend/test_model_consts.py adds the empty-string rejection plus a positive metadata-only construction assertion.

Follow-ups

Make create_chunk fail-loud on embedding resolution failure, matching the convention this PR establishes for update_chunk.
Today it silently proceeds without an embedding field when tenant_id or embedding_model_id is missing or when get_embeddings raises, and only logs a warning.
Deferred because the asymmetry is load-bearing — create-skip leaves an auditable missing field while update-stale leaves a wrong field that looks healthy — and flipping create may break callers that tolerate vector-less creates during KB setup churn, so it deserves its own caller-impact review.
Reconcile the inconsistency where the bulk ingest path writes the per-document field embedding_model_name (matching the index mapping) but create_chunk writes embedding_model_id, which lands as a dynamic field.
The mismatch originates in create_chunk and predates this PR; update_chunk mirrors create_chunk here, so this PR perpetuates the mismatch rather than introducing or fixing it.
Bound chunk content size at the schema boundary on both ChunkUpdateRequest and ChunkCreateRequest.
Today neither schema caps content length, so an oversize payload reaches the embedding provider directly; if the provider truncates server-side instead of erroring, a vector for only the prefix is stored against the full text, reintroducing the same vector/text drift class this PR addresses but sourced from oversize input instead of skipped re-embedding.

JasonW404 · 2026-06-24T04:02:57Z

+                        "tenant_id is required to re-embed updated chunk content.")
+                knowledge_record = get_knowledge_record({
+                    "index_name": index_name,
+                    "tenant_id": tenant_id,


每次内容更新都会触发 get_knowledge_record → get_embedding_model_by_id → get_embeddings，批量更新时会产生大量重复的 DB 查询和模型实例化。建议按 index_name 缓存 embedding model 引用。

Probably not worth it. update_chunk is not a hot loop. It's a single-chunk edit endpoint, so the two lookups (get_knowledge_record, get_embedding_model_by_id) run once per user edit — not per-chunk in a bulk path.

That said, bulk ingest uses a different code path (index_documents in backend/services/vectordatabase_service.py). It already takes embedding_model as a parameter resolved once by the caller — so per-document re-resolution never happens there.

YehongPan · 2026-06-24T05:14:13Z

+                    raise ValueError(
+                        f"Failed to resolve embedding model {embedding_model_id} "
+                        f"for index '{index_name}'.")
+                embeddings = embedding_model.get_embeddings(new_content)


[逻辑漏洞] embedding_model.get_embeddings(new_content) 调用时未对 new_content 的长度进行校验。如果更新的内容超过了 embedding 模型的最大 token 限制，可能导致 embedding 失败或截断。建议在调用前添加内容长度检查，或在异常处理中给出更明确的错误提示。

create_chunk has the same gap, so this is symmetric with the pre-existing path rather than a regression introduced here. That said, it should be out of scope for this PR. Added as a follow-up bullet.

WMC001 · 2026-06-24T07:18:12Z

LGTM. The re-embedding logic to backfill vectors for existing knowledge base entries is a sensible feature. No issues found.

🐛 Bugfix: re-embed chunk on content update

2187f22

Lingxi-Li requested review from Dallas98 and WMC001 as code owners June 17, 2026 04:02

JasonW404 reviewed Jun 24, 2026

View reviewed changes

YehongPan reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Bugfix: re-embed chunk on content update#3250

🐛 Bugfix: re-embed chunk on content update#3250
Lingxi-Li wants to merge 1 commit into
ModelEngine-Group:developfrom
Lingxi-Li:dev

Lingxi-Li commented Jun 17, 2026 •

edited

Loading

Uh oh!

JasonW404 Jun 24, 2026

Uh oh!

Lingxi-Li Jun 24, 2026

Uh oh!

YehongPan Jun 24, 2026

Uh oh!

Lingxi-Li Jun 24, 2026

Uh oh!

WMC001 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Lingxi-Li commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐛 Bugfix: re-embed chunk on content update so stored vector stays in sync

New behavior

Changes

Follow-ups

Uh oh!

JasonW404 Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Lingxi-Li Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

YehongPan Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Lingxi-Li Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

WMC001 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Lingxi-Li commented Jun 17, 2026 •

edited

Loading