Skip to content

Add custom operator scanning and improve CLI dataset handling#37

Closed
cmgzn wants to merge 8 commits intodatajuicer:mainfrom
cmgzn:fix/djxplan
Closed

Add custom operator scanning and improve CLI dataset handling#37
cmgzn wants to merge 8 commits intodatajuicer:mainfrom
cmgzn:fix/djxplan

Conversation

@cmgzn
Copy link
Copy Markdown
Collaborator

@cmgzn cmgzn commented Apr 10, 2026

Introduce custom operator scanning capabilities and enhance the command-line interface for dataset handling. The changes streamline the integration of custom operators and enforce stricter validation for dataset sources. Additionally, the handling of custom operator paths has been centralized within the process specification.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for custom operators in the Data-Juicer planner by scanning provided paths via AST and injecting them into the LLM's retrieval candidates. It refactors the ownership of custom_operator_paths from the system specification to the process specification and enforces mutual exclusivity among dataset source CLI arguments. Feedback highlights security concerns regarding dynamic code loading, potential performance issues with recursive file scanning, and limitations in the AST-based metadata extraction for inherited classes and operator type inference.

Comment thread data_juicer_agents/tools/plan/_shared/process_spec.py Outdated
Comment thread data_juicer_agents/capabilities/plan/custom_op_scanner.py Outdated
Comment thread data_juicer_agents/capabilities/plan/custom_op_scanner.py Outdated
Comment thread data_juicer_agents/capabilities/plan/custom_op_scanner.py Outdated
@cmgzn cmgzn changed the title Add custom operator scanning and improve CLI dataset handling [WIP] Add custom operator scanning and improve CLI dataset handling Apr 10, 2026
@cmgzn cmgzn changed the title [WIP] Add custom operator scanning and improve CLI dataset handling Add custom operator scanning and improve CLI dataset handling Apr 10, 2026
@ShenQianli
Copy link
Copy Markdown
Collaborator

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37e537da70

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread data_juicer_agents/capabilities/plan/custom_op_scanner.py Outdated
Comment thread data_juicer_agents/capabilities/plan/custom_op_scanner.py Outdated
Comment thread data_juicer_agents/commands/apply_cmd.py
@ShenQianli
Copy link
Copy Markdown
Collaborator

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e49ccbecc5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


# Inject custom operators into retrieval candidates so the LLM
# planner can select them alongside built-in operators.
custom_candidates = scan_custom_operators(registered_op_names or None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Scope custom candidate injection to requested paths

Calling scan_custom_operators(registered_op_names or None) means a plan with no custom_operator_paths (or an empty registration result) falls back to scanning all custom operators currently in the global registry. In a long-lived process, this leaks operators from previous sessions into unrelated plans, and the generator can pick one that later fails build_process_spec path checks, causing plan generation to fail nondeterministically after any prior custom-op usage.

Useful? React with 👍 / 👎.

# If warnings were raised and no operators were registered at all,
# the load effectively failed — report ok=False so callers don't
# silently proceed with an empty operator set.
success = bool(all_custom_names) or not load_warnings
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Mark registration failed when new path loading warns

The success condition uses bool(all_custom_names) or not load_warnings, so if any custom operator was already registered earlier, loading errors for the current paths still return ok=True. This lets callers proceed as if registration succeeded even when requested paths failed to load (e.g., bad module/import errors), which can silently plan against stale operators and break execution later.

Useful? React with 👍 / 👎.

@cmgzn cmgzn closed this Apr 15, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e49ccbecc5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# If warnings were raised and no operators were registered at all,
# the load effectively failed — report ok=False so callers don't
# silently proceed with an empty operator set.
success = bool(all_custom_names) or not load_warnings
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Return failure when custom-path loading emits warnings

The success condition bool(all_custom_names) or not load_warnings can report ok=True even when the current paths failed to load, as long as any custom operator was registered earlier in the same process. That makes callers treat a failed registration as successful and continue planning with stale operators, masking bad user input and producing plans that omit the intended custom operator.

Useful? React with 👍 / 👎.


# Inject custom operators into retrieval candidates so the LLM
# planner can select them alongside built-in operators.
custom_candidates = scan_custom_operators(registered_op_names or None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip custom-operator scan when no paths were requested

Passing registered_op_names or None triggers registry-diff scanning when registered_op_names is empty, so plans created without custom_operator_paths can still include custom operators left in the global registry from earlier requests. In long-lived sessions this leaks stale custom operators into the candidate list, and selecting one can make build_process_spec fail because the operator is not from the current plan's paths.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants