OCPEDGE-2727 Add eval harness config for cluster-diagnostic skill#80177
OCPEDGE-2727 Add eval harness config for cluster-diagnostic skill#80177dhensel-rh wants to merge 3 commits into
Conversation
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR adds two optional CI test jobs ( ChangesOptional CI Evaluation Test Jobs
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Suggested labels
🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
ci-operator/projects/OCPEDGE-2505/findings.md (1)
69-81: ⚡ Quick winClarify interface naming:
NodeControllervsNodePowerController.The document proposes two similar but distinct interfaces:
- Lines 69-76:
NodeControllerwith 4 methods- Lines 144-160:
NodePowerControllerwith 4 methods plusFindNode()The later interface (lines 144-160) adds the
FindNode()method and uses more explicit type names (PowerStateenum). It appears to be the refined version.Consider removing the earlier interface definition (lines 69-81) and keeping only the detailed design in section 2, or explicitly note that lines 69-76 show the initial concept and lines 144-160 show the final design.
Also applies to: 143-160
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/projects/OCPEDGE-2505/findings.md` around lines 69 - 81, The file defines two overlapping interfaces—NodeController (with PowerOff/PowerOn/GetNodeState/WaitForNodeState) and a refined NodePowerController (with the same methods plus FindNode and using PowerState); remove the earlier NodeController block or mark it explicitly as deprecated/initial concept so only the refined NodePowerController (including FindNode and PowerState) remains as the canonical interface; update any surrounding text to reference NodePowerController and its methods (PowerOff, PowerOn, GetNodeState, WaitForNodeState, FindNode) to avoid duplication/confusion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@ci-operator/projects/OCPEDGE-2505/findings.md`:
- Around line 69-81: The file defines two overlapping interfaces—NodeController
(with PowerOff/PowerOn/GetNodeState/WaitForNodeState) and a refined
NodePowerController (with the same methods plus FindNode and using PowerState);
remove the earlier NodeController block or mark it explicitly as
deprecated/initial concept so only the refined NodePowerController (including
FindNode and PowerState) remains as the canonical interface; update any
surrounding text to reference NodePowerController and its methods (PowerOff,
PowerOn, GetNodeState, WaitForNodeState, FindNode) to avoid
duplication/confusion.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 1a778987-f31c-4a7d-b3af-656bc5ff49bb
📒 Files selected for processing (8)
ci-operator/config/openshift-eng/edge-tooling/openshift-eng-edge-tooling-main.yamlci-operator/projects/OCPEDGE-2505/.gitignoreci-operator/projects/OCPEDGE-2505/CLAUDE.mdci-operator/projects/OCPEDGE-2505/findings.mdci-operator/projects/tnf-upgrade-ci-lane/.gitignoreci-operator/projects/tnf-upgrade-ci-lane/CLAUDE.mdci-operator/projects/tnf-upgrade-ci-lane/ci-runs.mdci-operator/projects/tnf-upgrade-ci-lane/test-failures.md
✅ Files skipped from review due to trivial changes (6)
- ci-operator/projects/tnf-upgrade-ci-lane/.gitignore
- ci-operator/projects/OCPEDGE-2505/.gitignore
- ci-operator/projects/OCPEDGE-2505/CLAUDE.md
- ci-operator/projects/tnf-upgrade-ci-lane/test-failures.md
- ci-operator/projects/tnf-upgrade-ci-lane/ci-runs.md
- ci-operator/projects/tnf-upgrade-ci-lane/CLAUDE.md
🚧 Files skipped from review as they are similar to previous changes (1)
- ci-operator/config/openshift-eng/edge-tooling/openshift-eng-edge-tooling-main.yaml
| from: root | ||
| run_if_changed: (SKILL\.md|^scripts/lint-skills\.py|^Makefile|^plugins/.*/skills/) | ||
| - always_run: false | ||
| as: eval-cluster-diagnostic |
There was a problem hiding this comment.
Should we adopt the name of eval-<plugin> for the test?
There's also a question if we're going to run these evals separately or inside the same job.
There was a problem hiding this comment.
The name is set when you run this command
/eval-analyze --skill <skill> --config <eval config name>
I am ok with the naming convention, starting with eval-<plugin> since they will be easy to recognize in openshift-eng-edge-tooling-main.yaml
I think the evals should run as part of CI for new skills. for existing skills, we should probably go back and revisit what we choose to do.
On a related note, I worry about
openshift-eng-edge-tooling-main.yaml growing a little unwieldy because the number of skills we currently have and might have in the future.
| clone: true | ||
| from: root | ||
| run_if_changed: (SKILL\.md|^scripts/lint-skills\.py|^Makefile|^plugins/.*/skills/) | ||
| - always_run: false |
There was a problem hiding this comment.
Worth considering adding run_if_changed: ^plugins/two-node/ here so these evals auto-trigger when the two-node skill code changes, rather than relying on someone remembering to type
/test eval-cluster-diagnostic manually.
There was a problem hiding this comment.
Adding this line to
run_if_changed: ^plugins/two-node/(skills|evals)
will properly match files under skills or evals.
The main thing to be mindful of is that each triggered eval uses Claude Opus and can run up to 2 hours, so you'll consume API credits on every matching PR. If that's acceptable for the team, auto-triggering is fine.
| optional: true | ||
| steps: | ||
| env: | ||
| EVAL_CONFIG: plugins/two-node/evals/cluster-diagnostic.yaml |
There was a problem hiding this comment.
Instead of a separate test entry per eval, have you considered supporting a directory-based approach — something like EVAL_CONFIG_DIR: plugins/two-node/evals/(Not sure if this is feasible as of today, but asking) — where the commands.sh discovers and runs all eval configs in the directory? That way new evals are picked up automatically without needing a ci-operator config change each time. Would require a small update to the commands.sh loop logic from #79925.
There was a problem hiding this comment.
This is possible to add. As a proof of concept to get buy-in from the team, I think I'll stick to the manual approach for now. This could be future enhancement.
|
I think you are already aware of this, but the PR is missing generated files can you help add that please ? thanks !! |
…eat-model skills Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dhensel-rh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
@dhensel-rh: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Superseded by #81166 — fresh branch from upstream/main with all changes consolidated. |
Summary
recommendations, knowledge base accuracy
Test cases
Test cases
Five cases covering both testable modes:
validate (3 cases)
shutdown -h— expects BLOCKERpcs node standby --all— expects BLOCKERrecovery-guide (2 cases)
Summary by CodeRabbit
This pull request updates the OpenShift Edge Tooling CI configuration (
ci-operator/config/openshift-eng/edge-tooling/openshift-eng-edge-tooling-main.yaml) by adding two new optional evaluation jobs:eval-cluster-diagnosticeval-threat-model-tnfBoth jobs run the
openshift-claude-agent-evalworkflow with the same settings:EVAL_MODEL=claude-opus-4-6EVAL_PARALLELISM="3"EVAL_CONFIGpointing to a per-skill YAML underplugins/two-node/evals/eval-cluster-diagnosticis additionally configured withrun_if_changed: ^plugins/two-node/(skills|evals)so it triggers when the two-node skill/eval inputs change.For
eval-cluster-diagnostic, the PR aims to add CI-exercised evaluation coverage for two modes—validate(cases 001/002/005, with a mix of shutdown command and expected failure/pass expectations) andrecovery-guide(cases 003/004)—evaluated by five judges: budget check, severity classification, procedure completeness, forbidden recommendations, and knowledge base accuracy.Note: the referenced evaluation config files (
plugins/two-node/evals/cluster-diagnostic.yamlandplugins/two-node/evals/threat-model-tnf.yaml) are not present in the current repo checkout, consistent with the draft review feedback that generated files may still be missing to make the jobs runnable end-to-end.