WIP GIE-460: Add mcpchecker evals for obs-mcp tools#34
WIP GIE-460: Add mcpchecker evals for obs-mcp tools#34slashpai wants to merge 13 commits intorhobs:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: slashpai The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| taskSets: | ||
| # Metric discovery | ||
| - path: tasks/metrics/list-metrics.yaml | ||
| assertions: |
There was a problem hiding this comment.
I know it's not fault of this PR, but the split between the task definition and assertions is pretty terrible. Luckily the mcpchecker folks seem to be aware mcpchecker/mcpchecker#168
|
@slashpai: This pull request references GIE-460 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Add evaluation tasks using the mcpchecker framework (v1alpha2) to test that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend. Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Mark all 16 task YAMLs with parallel: true for concurrent execution via mcpchecker's new --parallel flag. Update README to reflect CLI rename (eval -> check), multi-provider llm-agent replacing deprecated openai-agent, and parallel execution usage. Signed-off-by: Jayapriya Pai <janantha@redhat.com>
|
@slashpai: This pull request references GIE-460 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@slashpai: This pull request references GIE-460 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@slashpai: This pull request references GIE-460 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Move lightspeed evals into evals/lightspeed/ to match the evals/mcpchecker/ structure. Clean up mcpchecker README formatting, replace deprecated openai-agent references with llm-agent, and add coverage summary with future improvement notes. Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Update eval.yaml to use builtin.llm-agent with openai:gpt-4o-mini as the default agent, replacing builtin.claude-code. Align README with the new defaults: gpt-4o-mini for both agent and judge, link to TESTING.md for deployment setup and add .gitignore for generated output files. Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Add MCPChecker Evals section to TESTING.md with instructions for Kind (in-cluster and local) and OpenShift setups. Fix missing OPENAI_API_KEY in quick-start snippets, add pods-created to coverage table, fix typo, and link agent configuration to upstream mcpchecker docs. Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
|
@slashpai: This pull request references GIE-460 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
…pe targets Signed-off-by: Jayapriya Pai <janantha@redhat.com>
- Add callOrder assertions to enforce list_metrics-first workflow - Add runs metadata (2 for easy, 3 for medium/hard) and category labels - Tighten maxToolCalls bounds and strengthen LLM judge criteria - Replace duplicate get-active-alerts with multi-step alert-investigation task - Add namespace-resource-usage and diagnose-cluster-health hard tasks Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Eval run showed agents legitimately use execute_range_query for prompts that were assumed to need execute_instant_query. Uses toolPattern to accept either, simplifies callOrder to list_metrics-first only, and increases maxToolCalls from 7 to 15. Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Developer-facing references for eval authoring and debugging. Also adds query efficiency tips to METRICS_REFERENCE.md and a single-task run example to TESTING.md. Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.
mcpchecker version used: https://github.com/mcpchecker/mcpchecker/releases/tag/nightly