Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def node_init(self):
# pylint: disable=import-outside-toplevel
from hugegraph_llm.utils.vector_index_utils import get_vector_index_class

if not self.wk_input.examples:
if self.wk_input.examples is None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ notis None 放宽了校验,配合 operator 侧的 clean() 会放行空列表

改动前:if not self.wk_input.examples:None[] 都返回错误,调用方拿不到 CStatus -1。

改动后:只有 None 报错,[] 会继续执行,最终调用 operator 的 run() → 触发 self.vector_index.clean("gremlin_examples") 把现有索引清掉。

两处改动叠加后的实际效果是:传入 examples=[] 会静默销毁已有索引,但错误信息依然写着 "examples is required",语义自相矛盾。

建议二选一:

  • 若允许空列表走 operator 的空分支:把错误信息改为只针对 None,并在 PR 描述中说明设计意图
  • 若不允许空列表:保留原来的 if not ...: 校验

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ notis None relaxes the check and contradicts the error message.

The behavior change

  • Before: if not self.wk_input.examples: — both None and [] are rejected, caller gets CStatus(-1, "examples is required...").
  • After: if self.wk_input.examples is None: — only None is rejected; [] is accepted and flows through to the operator.

Why this matters (interacts with the operator change)

Combined with the new empty-examples branch in the operator, the effective path for wk_input.examples = [] is now:

  1. node_init accepts the empty list (no longer returns CStatus(-1)).
  2. The operator's run() enters its new early-return branch.
  3. That branch calls self.vector_index.clean("gremlin_examples")silently wiping the user's existing index.

So an empty list, which the error message still claims is invalid ("examples is required in BuildGremlinExampleIndexNode"), in fact now acts as an implicit reset command. The validation message and the runtime behavior contradict each other.

Suggested resolution — pick one

  • Keep the stricter check (recommended unless there is an explicit reason to accept []). Revert this line to if not self.wk_input.examples: and, per the operator-side comments, remove the clean() calls.
  • Intentionally allow empty list as a reset command. Then update the error message to something like "examples cannot be None", document this behavior in the PR description, and keep (some form of) the operator-side reset logic — but make the reset explicit and logged, not silent.

return CStatus(-1, "examples is required in BuildGremlinExampleIndexNode")
examples = self.wk_input.examples
vector_index = get_vector_index_class(index_settings.cur_vector_index)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,24 @@ def __init__(
self.vector_index = vector_index

def run(self, context: Dict[str, Any]) -> Dict[str, Any]:
# !: We have assumed that self.example is not empty
if not self.examples:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ BuildGremlinExampleIndexNode.node_init() still rejects [] via if not self.wk_input.examples (hugegraph-llm/src/hugegraph_llm/nodes/index_node/build_gremlin_example_index.py:37), so the main BuildExampleIndexFlow never reaches this branch for the empty-list case. As written, the unit test passes but the end-to-end workflow still returns examples is required. Could we align the node-level guard in the same PR or add flow-level coverage?

self.vector_index.clean(self.vector_index_name)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ 不应在空 examples 时调用 clean(),这是破坏性副作用

vector_index.clean("gremlin_examples") 在本仓库其他地方(如 utils/graph_index_utils.py:50)是用作销毁整个索引文件的语义。当前实现会导致:调用者传入 examples=[] 本意是"跳过构建",但实际会静默删除已经建好的索引,造成数据丢失。

PR 描述只提到"修复 IndexError",没有说明这个行为变更。建议回归到纯早返回:

Suggested change
self.vector_index.clean(self.vector_index_name)
if not self.examples:
context["embed_dim"] = 0
return context

如果确实需要"空 examples 时清空旧索引"的语义,请:

  1. 在 PR 描述中显式说明这是设计意图
  2. 确认上游调用方(operator_list.py 等)不会意外传空列表

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Calling clean() on empty examples is a destructive side effect that the PR description does not mention.

Why this is a problem

In this repository, vector_index.clean(name) is the API used to destroy an existing index on disk. Look at how it is used elsewhere:

  • utils/graph_index_utils.py:49-50 — used to wipe graph_vids and gremlin_examples when the user explicitly asks for a reset.
  • utils/vector_index_utils.py:77 — used to wipe the chunks index before a rebuild.

So the intent of clean() is always "tear down", never "skip this step".

What this PR actually does

Previously, if a caller accidentally passed examples=[], run() would raise IndexError — noisy, but non-destructive. With this change, the same call path will now:

  1. Return embed_dim=0 silently, and
  2. Delete the existing gremlin_examples index on disk.

A user who has already built a working index and then re-triggers the workflow with an empty example list will lose their index without any warning. The PR title says "handle empty gremlin examples when building index", which readers will reasonably interpret as a no-op guard, not a reset.

Suggested fix

Drop the clean() call; a pure early return preserves existing state and still fixes the IndexError:

Suggested change
self.vector_index.clean(self.vector_index_name)
if not self.examples:
context["embed_dim"] = 0
return context

If the intent really is "empty input should reset the index", please:

  1. State this explicitly in the PR description so reviewers and downstream users are aware, and
  2. Confirm no upstream caller (e.g. operators/operator_list.py:69) can reach this path with an unintended empty list.

context["embed_dim"] = 0
return context
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ Returning early here leaves any previously persisted gremlin_examples index untouched. GremlinExampleIndexQuery reuses that index whenever exist("gremlin_examples") is true, so rebuilding with an empty example set will still serve stale examples from the previous run. If an empty list is meant to clear the few-shot examples, we should wipe the index before returning.

Suggested change
return context
if not self.examples:
self.vector_index.clean(self.vector_index_name)
context["embed_dim"] = 0
return context

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I updated the implementation to clean the existing gremlin_examples index before returning for the empty examples / empty embeddings cases, so stale examples will not be reused.

I also aligned the node-level guard to reject only None and allow an empty examples list to reach the no-op path.

Tests:
uv run pytest src/tests/operators/index_op/test_build_gremlin_example_index.py -v
uv run pytest src/tests/operators/index_op/ -v


queries = [example["query"] for example in self.examples]
# TODO: refactor function chain async to avoid blocking
examples_embedding = asyncio.run(get_embeddings_parallel(self.embedding, queries))

if not examples_embedding:
self.vector_index.clean(self.vector_index_name)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ 空 embeddings 情况下 clean() 更危险,且掩盖了真实错误

走到这个分支意味着:self.examples 非空,但 embedding 服务返回了空列表 —— 这通常是 embedding 调用失败(网络、API key、配额等),而不是正常业务场景。此时静默 clean() + 返回 embed_dim=0 会:

  1. 清掉用户之前建好的正常索引
  2. 让调用方以为"成功构建了 0 维索引",掩盖真实故障

建议改为告警或抛异常:

Suggested change
self.vector_index.clean(self.vector_index_name)
if not examples_embedding:
log.error("Embedding returned empty for %d examples, skip index build", len(self.examples))
context["embed_dim"] = 0
return context

或直接 raise RuntimeError(...) 让上层感知失败。至少不要 clean()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ This branch silently hides embedding failures and destroys the existing index.

When this branch is actually reached

Control only gets here when self.examples is non-empty but get_embeddings_parallel(...) returns []. For a non-empty query list, an empty embedding result is almost never a legitimate business outcome — it typically indicates:

  • The embedding provider is unreachable (network / DNS / timeout),
  • Authentication / quota failure (bad API key, rate limit, exhausted credit),
  • A bug in the embedding client returning an empty list instead of raising.

In other words, this is an error signal, not an empty-input case.

Why the current handling is wrong

The code currently:

  1. Silently calls clean(), deleting any previously built index, and
  2. Returns embed_dim=0 as if the operation succeeded.

From the caller's perspective, the workflow looks successful. But on disk, a working gremlin_examples index has just been wiped, and the user has no way to tell something went wrong.

Suggested fix

At minimum, do not clean(). Prefer one of the two options below:

Option A — log loudly and skip (preserves existing index):

Suggested change
self.vector_index.clean(self.vector_index_name)
if not examples_embedding:
log.error(
"Embedding service returned empty result for %d examples; skipping index build",
len(self.examples),
)
context["embed_dim"] = 0
return context

Option B — fail fast, which is what the previous IndexError was accidentally doing:

        if not examples_embedding:
            raise RuntimeError(
                f"Embedding service returned empty result for {len(self.examples)} examples"
            )

Either is acceptable; the key point is that a silent clean() here turns a recoverable failure into irrecoverable data loss.

context["embed_dim"] = 0
return context

embed_dim = len(examples_embedding[0])
if len(self.examples) > 0:
vector_index = self.vector_index.from_name(embed_dim, self.vector_index_name)
vector_index.add(examples_embedding, self.examples)
vector_index.save_index_by_name(self.vector_index_name)
vector_index = self.vector_index.from_name(embed_dim, self.vector_index_name)
vector_index.add(examples_embedding, self.examples)
vector_index.save_index_by_name(self.vector_index_name)

context["embed_dim"] = embed_dim
return context
return context
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 文件末尾缺少换行符(diff 显示 \ No newline at end of file

违反 PEP 8 W292,某些 lint/pre-commit 钩子会失败。请在文件末尾补一个换行。

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Missing newline at end of file.

The diff shows \ No newline at end of file on the final return context line. This violates PEP 8 (rule W292) and will likely trip the project's pre-commit / lint hooks. Please add a trailing newline.

Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,10 @@ def test_init(self):
self.assertEqual(self.index_builder.vector_index_name, "gremlin_examples")

@patch('asyncio.run')
@patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel')
@patch(
'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Patch 目标的修正合理

hugegraph_llm.utils.embedding_utils.get_embeddings_parallel 改为 hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel 是正确的 —— Python patch 应该打在"被 import 到的位置"而不是"定义位置",原来的 patch 其实没生效。new_callable=MagicMock 配合 asyncio.run 的 mock 也合理,避免协程警告。这部分无需修改。

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Patch target fix is correct — nice catch.

Rebinding the patch from hugegraph_llm.utils.embedding_utils.get_embeddings_parallel to hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel is the right move. The rule of thumb for unittest.mock.patch is "patch where the name is looked up, not where it is defined" — since the operator does from hugegraph_llm.utils.embedding_utils import get_embeddings_parallel, the function is bound to the operator module's namespace at import time, and that's the binding the test must replace.

The previous patch target looked plausible but never actually intercepted the call, so the test was implicitly relying on asyncio.run being mocked to avoid real work. Using new_callable=MagicMock here is also correct — it prevents the mock from being treated as an awaitable and suppresses the "coroutine was never awaited" warning. No change needed to this line.

new_callable=MagicMock,
)
def test_run_with_examples(self, mock_get_embeddings_parallel, mock_asyncio_run):
"""Test run method with examples"""
# Setup mocks
Expand All @@ -78,7 +81,10 @@ def test_run_with_examples(self, mock_get_embeddings_parallel, mock_asyncio_run)
self.assertEqual(context["embed_dim"], 3)

@patch('asyncio.run')
@patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel')
@patch(
'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel',
new_callable=MagicMock,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 测试名与实际验证不一致

当前测试名叫 test_run_with_empty_examples,但 assertions 里同时验证了 clean 被调用、from_name/add/save 都不被调用 —— 等于把 "空 examples 会静默删除现有索引" 这个行为写死成了契约。如果采纳前面 operator 侧的修改建议(移除 clean 调用),这里需要同步:

Suggested change
new_callable=MagicMock,
mock_vector_store_class.clean.assert_not_called()

另外建议补一个 test_run_with_non_empty_examples_but_empty_embeddings 用例,单独覆盖 embeddings 为空(非 examples 为空)的分支。

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This test locks in the destructive behavior as a contract; please reconsider after the operator changes.

What this test currently asserts

mock_vector_store_class.clean.assert_called_once_with("gremlin_examples")
mock_vector_store_class.from_name.assert_not_called()
# ...

This effectively turns "empty examples → delete the on-disk index" into a frozen contract enforced by CI. If the operator-side comments are accepted and clean() is removed, this assertion must flip:

Suggested change
new_callable=MagicMock,
mock_vector_store_class.clean.assert_not_called()

Missing coverage

The two early-return branches in run() (empty examples vs. empty examples_embedding) are semantically distinct — the second one implies an embedding-service failure, not an empty-input case. Please add a dedicated test, e.g. test_run_with_non_empty_examples_but_empty_embeddings, that:

  • Constructs the builder with non-empty examples,
  • Makes asyncio.run(...) return [],
  • Asserts the chosen error/logging behavior (raise, or log + skip),
  • Asserts clean() is not called.

)
def test_run_with_empty_examples(self, mock_get_embeddings_parallel, mock_asyncio_run):
"""Test run method with empty examples"""
# Create new mocks for this test
Expand All @@ -98,12 +104,24 @@ def test_run_with_empty_examples(self, mock_get_embeddings_parallel, mock_asynci
# Run the method
context = {}

# This should raise an IndexError when trying to access examples_embedding[0]
with self.assertRaises(IndexError):
empty_index_builder.run(context)
# Empty examples should be handled gracefully without building vector index.
result = empty_index_builder.run(context)

mock_asyncio_run.assert_not_called()
mock_get_embeddings_parallel.assert_not_called()
mock_vector_store_class.clean.assert_called_once_with("gremlin_examples")
mock_vector_store_class.from_name.assert_not_called()
mock_vector_store_instance.add.assert_not_called()
mock_vector_store_instance.save_index_by_name.assert_not_called()

self.assertEqual(result["embed_dim"], 0)
self.assertEqual(context["embed_dim"], 0)

@patch('asyncio.run')
@patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel')
@patch(
'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel',
new_callable=MagicMock,
)
def test_run_single_example(self, mock_get_embeddings_parallel, mock_asyncio_run):
"""Test run method with single example"""
# Create new mocks for this test
Expand Down Expand Up @@ -134,7 +152,10 @@ def test_run_single_example(self, mock_get_embeddings_parallel, mock_asyncio_run
self.assertEqual(result["embed_dim"], 4)

@patch('asyncio.run')
@patch('hugegraph_llm.utils.embedding_utils.get_embeddings_parallel')
@patch(
'hugegraph_llm.operators.index_op.build_gremlin_example_index.get_embeddings_parallel',
new_callable=MagicMock,
)
def test_run_preserves_existing_context(self, mock_get_embeddings_parallel, mock_asyncio_run):
"""Test that run method preserves existing context data"""
# Setup mocks
Expand Down
Loading