Skip to content

WIP GIE-460: Add mcpchecker evals for obs-mcp tools#34

Draft
slashpai wants to merge 13 commits intorhobs:mainfrom
slashpai:mcp-evals
Draft

WIP GIE-460: Add mcpchecker evals for obs-mcp tools#34
slashpai wants to merge 13 commits intorhobs:mainfrom
slashpai:mcp-evals

Conversation

@slashpai
Copy link
Member

@slashpai slashpai commented Feb 25, 2026

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

mcpchecker version used: https://github.com/mcpchecker/mcpchecker/releases/tag/nightly

@openshift-ci
Copy link

openshift-ci bot commented Feb 25, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link

openshift-ci bot commented Feb 25, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: slashpai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

taskSets:
# Metric discovery
- path: tasks/metrics/list-metrics.yaml
assertions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not fault of this PR, but the split between the task definition and assertions is pretty terrible. Luckily the mcpchecker folks seem to be aware mcpchecker/mcpchecker#168

@slashpai slashpai changed the title WIP feat: Add mcpchecker evals for obs-mcp tools WIP GIE-460: Add mcpchecker evals for obs-mcp tools Mar 11, 2026
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Mar 11, 2026

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add evaluation tasks using the mcpchecker framework (v1alpha2) to test that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Add evaluation tasks using the mcpchecker framework (v1alpha2) to test
that AI agents can discover and correctly use all 8 obs-mcp tools
against a live Prometheus/Alertmanager backend.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Mark all 16 task YAMLs with parallel: true for concurrent execution
via mcpchecker's new --parallel flag. Update README to reflect CLI
rename (eval -> check), multi-provider llm-agent replacing deprecated
openai-agent, and parallel execution usage.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Mar 11, 2026

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

16 eval tasks across 4 categories:

Category Tasks Tools tested
Metrics discovery list kube metrics, list node metrics list_metrics
Label exploration label names, label values, series cardinality get_label_names, get_label_values, get_series
PromQL queries CPU usage, pending pods, crashlooping pods, network traffic, Prometheus internals (head series, requests, WAL size) execute_instant_query, execute_range_query
Alertmanager firing alerts, active alerts, silences get_alerts, get_silences

Each task verifies:

  • The agent selects the correct tool(s)
  • Tool call count stays within bounds
  • Response contains expected content (via LLM judge)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Mar 11, 2026

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

16 eval tasks across 4 categories:

Category Tasks Tools tested
Metrics discovery list kube metrics, list node metrics list_metrics
Label exploration label names, label values, series cardinality get_label_names, get_label_values, get_series
PromQL queries CPU usage, pending pods, crashlooping pods, network traffic, Prometheus internals (head series, requests, WAL size) execute_instant_query, execute_range_query
Alertmanager firing alerts, active alerts, silences get_alerts, get_silences

Each task verifies:

  • The agent selects the correct tool(s)
  • Tool call count stays within bounds
  • Response contains expected content (via LLM judge)

Note: This is a smoke-test level evaluation covering basic tool discovery and usage. We need to add:

  • Multi-step reasoning — tasks requiring 3+ chained tools (e.g., discover metric → query → analyze trend)
  • Error handling — agent recovery from invalid queries or missing metrics
  • Guardrail behavior — agent response when dangerous queries are blocked
  • Parameter coverage — testing less-used params like silenced, inhibited, receiver, filter, time ranges
  • Ambiguous prompts — vague diagnostic questions (e.g., "Why is my app slow?") requiring the agent to choose the right tools
  • Hard difficulty tasks — complex multi-tool diagnostic scenarios

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Mar 11, 2026

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Move lightspeed evals into evals/lightspeed/ to match the
evals/mcpchecker/ structure. Clean up mcpchecker README formatting,
replace deprecated openai-agent references with llm-agent, and add
coverage summary with future improvement notes.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Update eval.yaml to use builtin.llm-agent with openai:gpt-4o-mini as
the default agent, replacing builtin.claude-code. Align README with
the new defaults: gpt-4o-mini for both agent and judge, link to
TESTING.md for deployment setup and add .gitignore for generated
output files.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Add MCPChecker Evals section to TESTING.md with instructions for Kind
(in-cluster and local) and OpenShift setups. Fix missing OPENAI_API_KEY
in quick-start snippets, add pods-created to coverage table, fix typo,
and link agent configuration to upstream mcpchecker docs.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Mar 11, 2026

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

mcpchecker version used: https://github.com/mcpchecker/mcpchecker/releases/tag/nightly

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
…pe targets

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
- Add callOrder assertions to enforce list_metrics-first workflow
- Add runs metadata (2 for easy, 3 for medium/hard) and category labels
- Tighten maxToolCalls bounds and strengthen LLM judge criteria
- Replace duplicate get-active-alerts with multi-step alert-investigation task
- Add namespace-resource-usage and diagnose-cluster-health hard tasks

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Eval run showed agents legitimately use execute_range_query for prompts
that were assumed to need execute_instant_query. Uses toolPattern to
accept either, simplifies callOrder to list_metrics-first only, and
increases maxToolCalls from 7 to 15.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Developer-facing references for eval authoring and debugging. Also adds
query efficiency tips to METRICS_REFERENCE.md and a single-task run
example to TESTING.md.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants