Add custom operator scanning and improve CLI dataset handling by cmgzn · Pull Request #37 · datajuicer/data-juicer-agents

cmgzn · 2026-04-10T07:19:46Z

Introduce custom operator scanning capabilities and enhance the command-line interface for dataset handling. The changes streamline the integration of custom operators and enforce stricter validation for dataset sources. Additionally, the handling of custom operator paths has been centralized within the process specification.

gemini-code-assist

Code Review

This pull request introduces support for custom operators in the Data-Juicer planner by scanning provided paths via AST and injecting them into the LLM's retrieval candidates. It refactors the ownership of custom_operator_paths from the system specification to the process specification and enforces mutual exclusivity among dataset source CLI arguments. Feedback highlights security concerns regarding dynamic code loading, potential performance issues with recursive file scanning, and limitations in the AST-based metadata extraction for inherited classes and operator type inference.

…ests

ShenQianli · 2026-04-10T10:25:20Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37e537da70

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ShenQianli · 2026-04-15T04:01:48Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e49ccbecc5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-15T04:07:49Z

+
+        # Inject custom operators into retrieval candidates so the LLM
+        # planner can select them alongside built-in operators.
+        custom_candidates = scan_custom_operators(registered_op_names or None)


Scope custom candidate injection to requested paths

Calling scan_custom_operators(registered_op_names or None) means a plan with no custom_operator_paths (or an empty registration result) falls back to scanning all custom operators currently in the global registry. In a long-lived process, this leaks operators from previous sessions into unrelated plans, and the generator can pick one that later fails build_process_spec path checks, causing plan generation to fail nondeterministically after any prior custom-op usage.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-15T04:07:49Z

+    # If warnings were raised and no operators were registered at all,
+    # the load effectively failed — report ok=False so callers don't
+    # silently proceed with an empty operator set.
+    success = bool(all_custom_names) or not load_warnings


Mark registration failed when new path loading warns

The success condition uses bool(all_custom_names) or not load_warnings, so if any custom operator was already registered earlier, loading errors for the current paths still return ok=True. This lets callers proceed as if registration succeeded even when requested paths failed to load (e.g., bad module/import errors), which can silently plan against stale operators and break execution later.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e49ccbecc5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-15T12:42:27Z

+    # If warnings were raised and no operators were registered at all,
+    # the load effectively failed — report ok=False so callers don't
+    # silently proceed with an empty operator set.
+    success = bool(all_custom_names) or not load_warnings


Return failure when custom-path loading emits warnings

The success condition bool(all_custom_names) or not load_warnings can report ok=True even when the current paths failed to load, as long as any custom operator was registered earlier in the same process. That makes callers treat a failed registration as successful and continue planning with stale operators, masking bad user input and producing plans that omit the intended custom operator.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-15T12:42:28Z

+
+        # Inject custom operators into retrieval candidates so the LLM
+        # planner can select them alongside built-in operators.
+        custom_candidates = scan_custom_operators(registered_op_names or None)


Skip custom-operator scan when no paths were requested

Passing registered_op_names or None triggers registry-diff scanning when registered_op_names is empty, so plans created without custom_operator_paths can still include custom operators left in the global registry from earlier requests. In long-lived sessions this leaks stale custom operators into the candidate list, and selecting one can make build_process_spec fail because the operator is not from the current plan's paths.

Useful? React with 👍 / 👎.

feat: add custom operator scanning and improve CLI dataset handling

891f27f

gemini-code-assist Bot reviewed Apr 10, 2026

View reviewed changes

refactor: move custom_operator_paths ownership to ProcessSpec

539000f

cmgzn changed the title ~~Add custom operator scanning and improve CLI dataset handling~~ [WIP] Add custom operator scanning and improve CLI dataset handling Apr 10, 2026

cmgzn added 2 commits April 10, 2026 17:08

refactor: improve custom operator loading strategy and related docs/t…

56da7df

…ests

refactor: simplify custom operator handling and improve docs clarity

37e537d

cmgzn changed the title ~~[WIP] Add custom operator scanning and improve CLI dataset handling~~ Add custom operator scanning and improve CLI dataset handling Apr 10, 2026

chatgpt-codex-connector Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread data_juicer_agents/capabilities/plan/custom_op_scanner.py Outdated

Comment thread data_juicer_agents/capabilities/plan/custom_op_scanner.py Outdated

Comment thread data_juicer_agents/commands/apply_cmd.py

cmgzn added 4 commits April 13, 2026 16:49

refactor: update custom operator scanning and retrieval logic

3cce951

docs: revert SKILL.md files to main branch state

a0b2ba4

refactor: update custom op scanner and plan validation logic

7a07390

refactor: improve custom operator scanning and registration process

e49ccbe

chatgpt-codex-connector Bot reviewed Apr 15, 2026

View reviewed changes

cmgzn closed this Apr 15, 2026

chatgpt-codex-connector Bot reviewed Apr 15, 2026

View reviewed changes

Conversation

cmgzn commented Apr 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShenQianli commented Apr 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShenQianli commented Apr 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants