Add custom operator scanning and improve CLI dataset handling#37
Add custom operator scanning and improve CLI dataset handling#37cmgzn wants to merge 8 commits intodatajuicer:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for custom operators in the Data-Juicer planner by scanning provided paths via AST and injecting them into the LLM's retrieval candidates. It refactors the ownership of custom_operator_paths from the system specification to the process specification and enforces mutual exclusivity among dataset source CLI arguments. Feedback highlights security concerns regarding dynamic code loading, potential performance issues with recursive file scanning, and limitations in the AST-based metadata extraction for inherited classes and operator type inference.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 37e537da70
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e49ccbecc5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| # Inject custom operators into retrieval candidates so the LLM | ||
| # planner can select them alongside built-in operators. | ||
| custom_candidates = scan_custom_operators(registered_op_names or None) |
There was a problem hiding this comment.
Scope custom candidate injection to requested paths
Calling scan_custom_operators(registered_op_names or None) means a plan with no custom_operator_paths (or an empty registration result) falls back to scanning all custom operators currently in the global registry. In a long-lived process, this leaks operators from previous sessions into unrelated plans, and the generator can pick one that later fails build_process_spec path checks, causing plan generation to fail nondeterministically after any prior custom-op usage.
Useful? React with 👍 / 👎.
| # If warnings were raised and no operators were registered at all, | ||
| # the load effectively failed — report ok=False so callers don't | ||
| # silently proceed with an empty operator set. | ||
| success = bool(all_custom_names) or not load_warnings |
There was a problem hiding this comment.
Mark registration failed when new path loading warns
The success condition uses bool(all_custom_names) or not load_warnings, so if any custom operator was already registered earlier, loading errors for the current paths still return ok=True. This lets callers proceed as if registration succeeded even when requested paths failed to load (e.g., bad module/import errors), which can silently plan against stale operators and break execution later.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e49ccbecc5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # If warnings were raised and no operators were registered at all, | ||
| # the load effectively failed — report ok=False so callers don't | ||
| # silently proceed with an empty operator set. | ||
| success = bool(all_custom_names) or not load_warnings |
There was a problem hiding this comment.
Return failure when custom-path loading emits warnings
The success condition bool(all_custom_names) or not load_warnings can report ok=True even when the current paths failed to load, as long as any custom operator was registered earlier in the same process. That makes callers treat a failed registration as successful and continue planning with stale operators, masking bad user input and producing plans that omit the intended custom operator.
Useful? React with 👍 / 👎.
|
|
||
| # Inject custom operators into retrieval candidates so the LLM | ||
| # planner can select them alongside built-in operators. | ||
| custom_candidates = scan_custom_operators(registered_op_names or None) |
There was a problem hiding this comment.
Skip custom-operator scan when no paths were requested
Passing registered_op_names or None triggers registry-diff scanning when registered_op_names is empty, so plans created without custom_operator_paths can still include custom operators left in the global registry from earlier requests. In long-lived sessions this leaks stale custom operators into the candidate list, and selecting one can make build_process_spec fail because the operator is not from the current plan's paths.
Useful? React with 👍 / 👎.
Introduce custom operator scanning capabilities and enhance the command-line interface for dataset handling. The changes streamline the integration of custom operators and enforce stricter validation for dataset sources. Additionally, the handling of custom operator paths has been centralized within the process specification.