Skip to content

OCPEDGE-2727 Add eval harness config for cluster-diagnostic skill#80177

Closed
dhensel-rh wants to merge 3 commits into
openshift:mainfrom
dhensel-rh:OCPEDGE-2727
Closed

OCPEDGE-2727 Add eval harness config for cluster-diagnostic skill#80177
dhensel-rh wants to merge 3 commits into
openshift:mainfrom
dhensel-rh:OCPEDGE-2727

Conversation

@dhensel-rh

@dhensel-rh dhensel-rh commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add evaluation config and test cases for the two-node:cluster-diagnostic skill
  • Covers validate and recovery-guide modes (3 validate cases, 2 recovery-guide cases)
  • 5 judges: budget check, severity classification, procedure completeness, forbidden
    recommendations, knowledge base accuracy
  • Enables /test eval-cluster-diagnostic in CI once the release repo entry is added
    Test cases

Test cases

Five cases covering both testable modes:

validate (3 cases)

  • 001: Sequential shutdown with shutdown -h — expects BLOCKER
  • 002: Simultaneous Redfish GracefulShutdown — expects clean pass
  • 005: pcs node standby --all — expects BLOCKER

recovery-guide (2 cases)

  • 003: Full cluster shutdown procedure
  • 004: Standby recovery via SSH (no BMC involved)

Summary by CodeRabbit

This pull request updates the OpenShift Edge Tooling CI configuration (ci-operator/config/openshift-eng/edge-tooling/openshift-eng-edge-tooling-main.yaml) by adding two new optional evaluation jobs:

  • eval-cluster-diagnostic
  • eval-threat-model-tnf

Both jobs run the openshift-claude-agent-eval workflow with the same settings:

  • EVAL_MODEL=claude-opus-4-6
  • EVAL_PARALLELISM="3"
  • EVAL_CONFIG pointing to a per-skill YAML under plugins/two-node/evals/

eval-cluster-diagnostic is additionally configured with run_if_changed: ^plugins/two-node/(skills|evals) so it triggers when the two-node skill/eval inputs change.

For eval-cluster-diagnostic, the PR aims to add CI-exercised evaluation coverage for two modes—validate (cases 001/002/005, with a mix of shutdown command and expected failure/pass expectations) and recovery-guide (cases 003/004)—evaluated by five judges: budget check, severity classification, procedure completeness, forbidden recommendations, and knowledge base accuracy.

Note: the referenced evaluation config files (plugins/two-node/evals/cluster-diagnostic.yaml and plugins/two-node/evals/threat-model-tnf.yaml) are not present in the current repo checkout, consistent with the draft review feedback that generated files may still be missing to make the jobs runnable end-to-end.

@openshift-ci

openshift-ci Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2026
@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR adds two optional CI test jobs (eval-cluster-diagnostic and eval-threat-model-tnf) to the openshift-eng-edge-tooling build configuration. Both jobs invoke the openshift-claude-agent-eval workflow and share the same model and parallelism settings, each pointing to its own evaluation configuration file.

Changes

Optional CI Evaluation Test Jobs

Layer / File(s) Summary
Eval test job definitions
ci-operator/config/openshift-eng/edge-tooling/openshift-eng-edge-tooling-main.yaml
Two optional test jobs are introduced: eval-cluster-diagnostic and eval-threat-model-tnf. Both are wired to the openshift-claude-agent-eval workflow with EVAL_MODEL set to claude-opus-4-6 and EVAL_PARALLELISM set to "3", each with its own EVAL_CONFIG path under plugins/two-node/evals/. The eval-cluster-diagnostic job includes a run_if_changed trigger for `plugins/two-node/(skills

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Suggested labels

jira/valid-reference

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding evaluation harness configuration for the cluster-diagnostic skill, which aligns with the core content of the pull request.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo test files were modified or added in this PR. The PR only adds CI/YAML test job configurations (eval-cluster-diagnostic, eval-threat-model-tnf), not Ginkgo tests.
Test Structure And Quality ✅ Passed PR contains no Ginkgo test code to review. The check is not applicable as the PR only modifies CI/operator YAML configuration files and documentation.
Microshift Test Compatibility ✅ Passed This PR does not add any Ginkgo e2e tests. It only modifies CI/operator configuration YAML files to add workflow job entries. The check is not applicable since there are no It(), Describe(), Contex...
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR adds only CI/YAML configuration and evaluation workflow entries, not Ginkgo e2e tests. The check applies specifically to new Go test code with It(), Describe(), Context(), When() patterns,...
Topology-Aware Scheduling Compatibility ✅ Passed File is CI/operator configuration only, not deployment manifests or operator code. No scheduling constraints, pod affinity rules, topology assumptions, or cluster topologies are introduced.
Ote Binary Stdout Contract ✅ Passed The PR only modifies CI/operator YAML configuration files and contains no OTE binary Go source code. The check requires flagging stdout writes in process-level Go code, which are not present in thi...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds CI workflow configuration for AI agent evaluation harness, not Ginkgo e2e tests. Custom check for Ginkgo IPv6/disconnected network compatibility is not applicable.
No-Weak-Crypto ✅ Passed PR adds only CI configuration files for OpenShift edge-tooling evaluation tests; no weak crypto patterns (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-t...
Container-Privileges ✅ Passed No privileged container settings (privileged, hostPID, hostNetwork, hostIPC, SYS_ADMIN, allowPrivilegeEscalation, runAsRoot) found in the CI/operator configuration file added in this PR.
No-Sensitive-Data-In-Logs ✅ Passed The PR adds two optional CI test entries to the openshift-eng-edge-tooling-main.yaml file. The new entries (eval-cluster-diagnostic and eval-threat-model-tnf) only set non-sensitive environment var...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 6, 2026
@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 6, 2026
@openshift-ci openshift-ci Bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 7, 2026
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 7, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ci-operator/projects/OCPEDGE-2505/findings.md (1)

69-81: ⚡ Quick win

Clarify interface naming: NodeController vs NodePowerController.

The document proposes two similar but distinct interfaces:

  • Lines 69-76: NodeController with 4 methods
  • Lines 144-160: NodePowerController with 4 methods plus FindNode()

The later interface (lines 144-160) adds the FindNode() method and uses more explicit type names (PowerState enum). It appears to be the refined version.

Consider removing the earlier interface definition (lines 69-81) and keeping only the detailed design in section 2, or explicitly note that lines 69-76 show the initial concept and lines 144-160 show the final design.

Also applies to: 143-160

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-operator/projects/OCPEDGE-2505/findings.md` around lines 69 - 81, The file
defines two overlapping interfaces—NodeController (with
PowerOff/PowerOn/GetNodeState/WaitForNodeState) and a refined
NodePowerController (with the same methods plus FindNode and using PowerState);
remove the earlier NodeController block or mark it explicitly as
deprecated/initial concept so only the refined NodePowerController (including
FindNode and PowerState) remains as the canonical interface; update any
surrounding text to reference NodePowerController and its methods (PowerOff,
PowerOn, GetNodeState, WaitForNodeState, FindNode) to avoid
duplication/confusion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ci-operator/projects/OCPEDGE-2505/findings.md`:
- Around line 69-81: The file defines two overlapping interfaces—NodeController
(with PowerOff/PowerOn/GetNodeState/WaitForNodeState) and a refined
NodePowerController (with the same methods plus FindNode and using PowerState);
remove the earlier NodeController block or mark it explicitly as
deprecated/initial concept so only the refined NodePowerController (including
FindNode and PowerState) remains as the canonical interface; update any
surrounding text to reference NodePowerController and its methods (PowerOff,
PowerOn, GetNodeState, WaitForNodeState, FindNode) to avoid
duplication/confusion.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 1a778987-f31c-4a7d-b3af-656bc5ff49bb

📥 Commits

Reviewing files that changed from the base of the PR and between ccef783 and c3c9640.

📒 Files selected for processing (8)
  • ci-operator/config/openshift-eng/edge-tooling/openshift-eng-edge-tooling-main.yaml
  • ci-operator/projects/OCPEDGE-2505/.gitignore
  • ci-operator/projects/OCPEDGE-2505/CLAUDE.md
  • ci-operator/projects/OCPEDGE-2505/findings.md
  • ci-operator/projects/tnf-upgrade-ci-lane/.gitignore
  • ci-operator/projects/tnf-upgrade-ci-lane/CLAUDE.md
  • ci-operator/projects/tnf-upgrade-ci-lane/ci-runs.md
  • ci-operator/projects/tnf-upgrade-ci-lane/test-failures.md
✅ Files skipped from review due to trivial changes (6)
  • ci-operator/projects/tnf-upgrade-ci-lane/.gitignore
  • ci-operator/projects/OCPEDGE-2505/.gitignore
  • ci-operator/projects/OCPEDGE-2505/CLAUDE.md
  • ci-operator/projects/tnf-upgrade-ci-lane/test-failures.md
  • ci-operator/projects/tnf-upgrade-ci-lane/ci-runs.md
  • ci-operator/projects/tnf-upgrade-ci-lane/CLAUDE.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • ci-operator/config/openshift-eng/edge-tooling/openshift-eng-edge-tooling-main.yaml

from: root
run_if_changed: (SKILL\.md|^scripts/lint-skills\.py|^Makefile|^plugins/.*/skills/)
- always_run: false
as: eval-cluster-diagnostic

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we adopt the name of eval-<plugin> for the test?
There's also a question if we're going to run these evals separately or inside the same job.

@dhensel-rh dhensel-rh Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is set when you run this command

/eval-analyze --skill <skill> --config <eval config name>

I am ok with the naming convention, starting with eval-<plugin> since they will be easy to recognize in openshift-eng-edge-tooling-main.yaml

I think the evals should run as part of CI for new skills. for existing skills, we should probably go back and revisit what we choose to do.

On a related note, I worry about
openshift-eng-edge-tooling-main.yaml growing a little unwieldy because the number of skills we currently have and might have in the future.

clone: true
from: root
run_if_changed: (SKILL\.md|^scripts/lint-skills\.py|^Makefile|^plugins/.*/skills/)
- always_run: false

@kasturinarra kasturinarra Jun 9, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth considering adding run_if_changed: ^plugins/two-node/ here so these evals auto-trigger when the two-node skill code changes, rather than relying on someone remembering to type
/test eval-cluster-diagnostic manually.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this line to
run_if_changed: ^plugins/two-node/(skills|evals)
will properly match files under skills or evals.

The main thing to be mindful of is that each triggered eval uses Claude Opus and can run up to 2 hours, so you'll consume API credits on every matching PR. If that's acceptable for the team, auto-triggering is fine.

optional: true
steps:
env:
EVAL_CONFIG: plugins/two-node/evals/cluster-diagnostic.yaml

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a separate test entry per eval, have you considered supporting a directory-based approach — something like EVAL_CONFIG_DIR: plugins/two-node/evals/(Not sure if this is feasible as of today, but asking) — where the commands.sh discovers and runs all eval configs in the directory? That way new evals are picked up automatically without needing a ci-operator config change each time. Would require a small update to the commands.sh loop logic from #79925.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possible to add. As a proof of concept to get buy-in from the team, I think I'll stick to the manual approach for now. This could be future enhancement.

@kasturinarra

Copy link
Copy Markdown
Contributor

I think you are already aware of this, but the PR is missing generated files can you help add that please ? thanks !!

dhensel-rh and others added 2 commits June 16, 2026 14:40
…eat-model skills

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dhensel-rh dhensel-rh marked this pull request as ready for review June 16, 2026 18:43
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2026
@openshift-ci openshift-ci Bot requested review from fonta-rh and jerpeter1 June 16, 2026 18:44
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhensel-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot removed the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 16, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@dhensel-rh: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-eng-edge-tooling-main-eval-cluster-diagnostic openshift-eng/edge-tooling presubmit Presubmit changed
pull-ci-openshift-eng-edge-tooling-main-eval-threat-model-tnf openshift-eng/edge-tooling presubmit Presubmit changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@dhensel-rh: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dhensel-rh dhensel-rh marked this pull request as draft June 26, 2026 18:39
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 26, 2026
@dhensel-rh

Copy link
Copy Markdown
Contributor Author

Superseded by #81166 — fresh branch from upstream/main with all changes consolidated.

@dhensel-rh dhensel-rh closed this Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants