-
Notifications
You must be signed in to change notification settings - Fork 71
fix: handle empty gremlin examples when building index #323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,7 +34,7 @@ def node_init(self): | |
| # pylint: disable=import-outside-toplevel | ||
| from hugegraph_llm.utils.vector_index_utils import get_vector_index_class | ||
|
|
||
| if not self.wk_input.examples: | ||
| if self.wk_input.examples is None: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The behavior change
Why this matters (interacts with the operator change)Combined with the new empty-examples branch in the operator, the effective path for
So an empty list, which the error message still claims is invalid ("examples is required in BuildGremlinExampleIndexNode"), in fact now acts as an implicit reset command. The validation message and the runtime behavior contradict each other. Suggested resolution — pick one
|
||
| return CStatus(-1, "examples is required in BuildGremlinExampleIndexNode") | ||
| examples = self.wk_input.examples | ||
| vector_index = get_vector_index_class(index_settings.cur_vector_index) | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -38,14 +38,24 @@ def __init__( | |||||||||||||||||||||||||||
| self.vector_index = vector_index | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| def run(self, context: Dict[str, Any]) -> Dict[str, Any]: | ||||||||||||||||||||||||||||
| # !: We have assumed that self.example is not empty | ||||||||||||||||||||||||||||
| if not self.examples: | ||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||||||||||||||||||||||||
| self.vector_index.clean(self.vector_index_name) | ||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
PR 描述只提到"修复 IndexError",没有说明这个行为变更。建议回归到纯早返回:
Suggested change
如果确实需要"空 examples 时清空旧索引"的语义,请:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Why this is a problemIn this repository,
So the intent of What this PR actually doesPreviously, if a caller accidentally passed
A user who has already built a working index and then re-triggers the workflow with an empty example list will lose their index without any warning. The PR title says "handle empty gremlin examples when building index", which readers will reasonably interpret as a no-op guard, not a reset. Suggested fixDrop the
Suggested change
If the intent really is "empty input should reset the index", please:
|
||||||||||||||||||||||||||||
| context["embed_dim"] = 0 | ||||||||||||||||||||||||||||
| return context | ||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the review. I updated the implementation to clean the existing gremlin_examples index before returning for the empty examples / empty embeddings cases, so stale examples will not be reused. I also aligned the node-level guard to reject only None and allow an empty examples list to reach the no-op path. Tests: |
||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| queries = [example["query"] for example in self.examples] | ||||||||||||||||||||||||||||
| # TODO: refactor function chain async to avoid blocking | ||||||||||||||||||||||||||||
| examples_embedding = asyncio.run(get_embeddings_parallel(self.embedding, queries)) | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| if not examples_embedding: | ||||||||||||||||||||||||||||
| self.vector_index.clean(self.vector_index_name) | ||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 走到这个分支意味着:
建议改为告警或抛异常:
Suggested change
或直接
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When this branch is actually reachedControl only gets here when
In other words, this is an error signal, not an empty-input case. Why the current handling is wrongThe code currently:
From the caller's perspective, the workflow looks successful. But on disk, a working Suggested fixAt minimum, do not Option A — log loudly and skip (preserves existing index):
Suggested change
Option B — fail fast, which is what the previous if not examples_embedding:
raise RuntimeError(
f"Embedding service returned empty result for {len(self.examples)} examples"
)Either is acceptable; the key point is that a silent |
||||||||||||||||||||||||||||
| context["embed_dim"] = 0 | ||||||||||||||||||||||||||||
| return context | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| embed_dim = len(examples_embedding[0]) | ||||||||||||||||||||||||||||
| if len(self.examples) > 0: | ||||||||||||||||||||||||||||
| vector_index = self.vector_index.from_name(embed_dim, self.vector_index_name) | ||||||||||||||||||||||||||||
| vector_index.add(examples_embedding, self.examples) | ||||||||||||||||||||||||||||
| vector_index.save_index_by_name(self.vector_index_name) | ||||||||||||||||||||||||||||
| vector_index = self.vector_index.from_name(embed_dim, self.vector_index_name) | ||||||||||||||||||||||||||||
| vector_index.add(examples_embedding, self.examples) | ||||||||||||||||||||||||||||
| vector_index.save_index_by_name(self.vector_index_name) | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| context["embed_dim"] = embed_dim | ||||||||||||||||||||||||||||
| return context | ||||||||||||||||||||||||||||
| return context | ||||||||||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
违反 PEP 8 W292,某些 lint/pre-commit 钩子会失败。请在文件末尾补一个换行。
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The diff shows |
||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -54,7 +54,10 @@ def test_init(self): | |||||||||
| self.assertEqual(self.index_builder.vector_index_name, "gremlin_examples") | ||||||||||
|
|
||||||||||
| @patch('asyncio.run') | ||||||||||
| @patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel') | ||||||||||
| @patch( | ||||||||||
| 'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel', | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Patch 目标的修正合理 由
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Patch target fix is correct — nice catch. Rebinding the patch from The previous patch target looked plausible but never actually intercepted the call, so the test was implicitly relying on |
||||||||||
| new_callable=MagicMock, | ||||||||||
| ) | ||||||||||
| def test_run_with_examples(self, mock_get_embeddings_parallel, mock_asyncio_run): | ||||||||||
| """Test run method with examples""" | ||||||||||
| # Setup mocks | ||||||||||
|
|
@@ -78,7 +81,10 @@ def test_run_with_examples(self, mock_get_embeddings_parallel, mock_asyncio_run) | |||||||||
| self.assertEqual(context["embed_dim"], 3) | ||||||||||
|
|
||||||||||
| @patch('asyncio.run') | ||||||||||
| @patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel') | ||||||||||
| @patch( | ||||||||||
| 'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel', | ||||||||||
| new_callable=MagicMock, | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 当前测试名叫
Suggested change
另外建议补一个
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What this test currently assertsmock_vector_store_class.clean.assert_called_once_with("gremlin_examples")
mock_vector_store_class.from_name.assert_not_called()
# ...This effectively turns "empty examples → delete the on-disk index" into a frozen contract enforced by CI. If the operator-side comments are accepted and
Suggested change
Missing coverageThe two early-return branches in
|
||||||||||
| ) | ||||||||||
| def test_run_with_empty_examples(self, mock_get_embeddings_parallel, mock_asyncio_run): | ||||||||||
| """Test run method with empty examples""" | ||||||||||
| # Create new mocks for this test | ||||||||||
|
|
@@ -98,12 +104,24 @@ def test_run_with_empty_examples(self, mock_get_embeddings_parallel, mock_asynci | |||||||||
| # Run the method | ||||||||||
| context = {} | ||||||||||
|
|
||||||||||
| # This should raise an IndexError when trying to access examples_embedding[0] | ||||||||||
| with self.assertRaises(IndexError): | ||||||||||
| empty_index_builder.run(context) | ||||||||||
| # Empty examples should be handled gracefully without building vector index. | ||||||||||
| result = empty_index_builder.run(context) | ||||||||||
|
|
||||||||||
| mock_asyncio_run.assert_not_called() | ||||||||||
| mock_get_embeddings_parallel.assert_not_called() | ||||||||||
| mock_vector_store_class.clean.assert_called_once_with("gremlin_examples") | ||||||||||
| mock_vector_store_class.from_name.assert_not_called() | ||||||||||
| mock_vector_store_instance.add.assert_not_called() | ||||||||||
| mock_vector_store_instance.save_index_by_name.assert_not_called() | ||||||||||
|
|
||||||||||
| self.assertEqual(result["embed_dim"], 0) | ||||||||||
| self.assertEqual(context["embed_dim"], 0) | ||||||||||
|
|
||||||||||
| @patch('asyncio.run') | ||||||||||
| @patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel') | ||||||||||
| @patch( | ||||||||||
| 'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel', | ||||||||||
| new_callable=MagicMock, | ||||||||||
| ) | ||||||||||
| def test_run_single_example(self, mock_get_embeddings_parallel, mock_asyncio_run): | ||||||||||
| """Test run method with single example""" | ||||||||||
| # Create new mocks for this test | ||||||||||
|
|
@@ -134,7 +152,10 @@ def test_run_single_example(self, mock_get_embeddings_parallel, mock_asyncio_run | |||||||||
| self.assertEqual(result["embed_dim"], 4) | ||||||||||
|
|
||||||||||
| @patch('asyncio.run') | ||||||||||
| @patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel') | ||||||||||
| @patch( | ||||||||||
| 'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel', | ||||||||||
| new_callable=MagicMock, | ||||||||||
| ) | ||||||||||
| def test_run_preserves_existing_context(self, mock_get_embeddings_parallel, mock_asyncio_run): | ||||||||||
| """Test that run method preserves existing context data""" | ||||||||||
| # Setup mocks | ||||||||||
|
|
||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not→is None放宽了校验,配合 operator 侧的 clean() 会放行空列表改动前:
if not self.wk_input.examples:对None和[]都返回错误,调用方拿不到 CStatus -1。改动后:只有
None报错,[]会继续执行,最终调用 operator 的 run() → 触发self.vector_index.clean("gremlin_examples")把现有索引清掉。两处改动叠加后的实际效果是:传入
examples=[]会静默销毁已有索引,但错误信息依然写着 "examples is required",语义自相矛盾。建议二选一:
None,并在 PR 描述中说明设计意图if not ...:校验