WIP GIE-460: Add mcpchecker evals for obs-mcp tools by slashpai · Pull Request #34 · rhobs/obs-mcp

slashpai · 2026-02-25T08:10:57Z

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

mcpchecker version used: https://github.com/mcpchecker/mcpchecker/releases/tag/nightly

openshift-ci · 2026-02-25T08:11:01Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2026-02-25T08:11:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: slashpai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [slashpai]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

iNecas · 2026-03-09T10:57:12Z

evals/mcpchecker/eval.yaml

+  taskSets:
+    # Metric discovery
+    - path: tasks/metrics/list-metrics.yaml
+      assertions:


I know it's not fault of this PR, but the split between the task definition and assertions is pretty terrible. Luckily the mcpchecker folks seem to be aware mcpchecker/mcpchecker#168

openshift-ci-robot · 2026-03-11T02:20:56Z

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add evaluation tasks using the mcpchecker framework (v1alpha2) to test that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Add evaluation tasks using the mcpchecker framework (v1alpha2) to test that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

Mark all 16 task YAMLs with parallel: true for concurrent execution via mcpchecker's new --parallel flag. Update README to reflect CLI rename (eval -> check), multi-provider llm-agent replacing deprecated openai-agent, and parallel execution usage. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

openshift-ci-robot · 2026-03-11T02:52:01Z

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

16 eval tasks across 4 categories:

Category Tasks Tools tested

Metrics discovery list kube metrics, list node metrics list_metrics

Label exploration label names, label values, series cardinality get_label_names, get_label_values, get_series

PromQL queries CPU usage, pending pods, crashlooping pods, network traffic, Prometheus internals (head series, requests, WAL size) execute_instant_query, execute_range_query

Alertmanager firing alerts, active alerts, silences get_alerts, get_silences

Each task verifies:

The agent selects the correct tool(s)

Tool call count stays within bounds

Response contains expected content (via LLM judge)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-03-11T02:54:16Z

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

16 eval tasks across 4 categories:

Category Tasks Tools tested

Metrics discovery list kube metrics, list node metrics list_metrics

Label exploration label names, label values, series cardinality get_label_names, get_label_values, get_series

PromQL queries CPU usage, pending pods, crashlooping pods, network traffic, Prometheus internals (head series, requests, WAL size) execute_instant_query, execute_range_query

Alertmanager firing alerts, active alerts, silences get_alerts, get_silences

Each task verifies:

The agent selects the correct tool(s)

Tool call count stays within bounds

Response contains expected content (via LLM judge)

Note: This is a smoke-test level evaluation covering basic tool discovery and usage. We need to add:

Multi-step reasoning — tasks requiring 3+ chained tools (e.g., discover metric → query → analyze trend)

Error handling — agent recovery from invalid queries or missing metrics

Guardrail behavior — agent response when dangerous queries are blocked

Parameter coverage — testing less-used params like silenced, inhibited, receiver, filter, time ranges

Ambiguous prompts — vague diagnostic questions (e.g., "Why is my app slow?") requiring the agent to choose the right tools

Hard difficulty tasks — complex multi-tool diagnostic scenarios

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-03-11T02:55:14Z

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Move lightspeed evals into evals/lightspeed/ to match the evals/mcpchecker/ structure. Clean up mcpchecker README formatting, replace deprecated openai-agent references with llm-agent, and add coverage summary with future improvement notes. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

Update eval.yaml to use builtin.llm-agent with openai:gpt-4o-mini as the default agent, replacing builtin.claude-code. Align README with the new defaults: gpt-4o-mini for both agent and judge, link to TESTING.md for deployment setup and add .gitignore for generated output files. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

Add MCPChecker Evals section to TESTING.md with instructions for Kind (in-cluster and local) and OpenShift setups. Fix missing OPENAI_API_KEY in quick-start snippets, add pods-created to coverage table, fix typo, and link agent configuration to upstream mcpchecker docs. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

openshift-ci-robot · 2026-03-11T04:45:27Z

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

mcpchecker version used: https://github.com/mcpchecker/mcpchecker/releases/tag/nightly

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

…pe targets Signed-off-by: Jayapriya Pai <janantha@redhat.com>

- Add callOrder assertions to enforce list_metrics-first workflow - Add runs metadata (2 for easy, 3 for medium/hard) and category labels - Tighten maxToolCalls bounds and strengthen LLM judge criteria - Replace duplicate get-active-alerts with multi-step alert-investigation task - Add namespace-resource-usage and diagnose-cluster-health hard tasks Signed-off-by: Jayapriya Pai <janantha@redhat.com>

Eval run showed agents legitimately use execute_range_query for prompts that were assumed to need execute_instant_query. Uses toolPattern to accept either, simplifies callOrder to list_metrics-first only, and increases maxToolCalls from 7 to 15. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

Developer-facing references for eval authoring and debugging. Also adds query efficiency tips to METRICS_REFERENCE.md and a single-task run example to TESTING.md. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

openshift-ci bot added the do-not-merge/work-in-progress label Feb 25, 2026

openshift-ci bot added the approved label Feb 25, 2026

slashpai mentioned this pull request Mar 6, 2026

Use rhobs/obs-mcp toolset for Prometheus/Alertmanager (replaces observability mcp) openshift/openshift-mcp-server#124

Open

iNecas reviewed Mar 9, 2026

View reviewed changes

slashpai changed the title ~~WIP feat: Add mcpchecker evals for obs-mcp tools~~ WIP GIE-460: Add mcpchecker evals for obs-mcp tools Mar 11, 2026

slashpai added 2 commits March 11, 2026 07:51

feat: Add mcpchecker evals for obs-mcp tools

36ffaea

Add evaluation tasks using the mcpchecker framework (v1alpha2) to test that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

slashpai added 4 commits March 11, 2026 08:36

chore: add .env to gitignore

1fed864

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

slashpai force-pushed the mcp-evals branch from c2b4c03 to 1fed864 Compare March 11, 2026 04:44

slashpai added 7 commits March 11, 2026 10:19

docs(evals): streamline mcpchecker README agent configuration section

fa8297a

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

docs(evals): clarify agent vs judge LLM roles in mcpchecker README

a0bf128

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

chore: add Makefile targets to deploy additional kube-prometheus scra…

7085ac4

…pe targets Signed-off-by: Jayapriya Pai <janantha@redhat.com>

docs: add PROMPTS.md with example prompts for testing obs-mcp tools

6593e67

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

docs: move PROMPTS.md and METRICS_REFERENCE.md to docs/dev/

4d1169a

Developer-facing references for eval authoring and debugging. Also adds query efficiency tips to METRICS_REFERENCE.md and a single-task run example to TESTING.md. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

slashpai mentioned this pull request Mar 13, 2026

GIE-501(prompt): add query efficiency guidance to reduce excessive tool calls #46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP GIE-460: Add mcpchecker evals for obs-mcp tools#34

WIP GIE-460: Add mcpchecker evals for obs-mcp tools#34
slashpai wants to merge 13 commits intorhobs:mainfrom
slashpai:mcp-evals

slashpai commented Feb 25, 2026 •

edited

Loading

Uh oh!

openshift-ci bot commented Feb 25, 2026

Uh oh!

openshift-ci bot commented Feb 25, 2026

Uh oh!

iNecas Mar 9, 2026

Uh oh!

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci bot

Loading

Uh oh!

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci bot

Loading

Uh oh!

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci bot

Loading

Uh oh!

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci bot

Loading

Uh oh!

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

slashpai commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Feb 25, 2026

Uh oh!

openshift-ci bot commented Feb 25, 2026

Uh oh!

iNecas Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Mar 11, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 11, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 11, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 11, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 11, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

slashpai commented Feb 25, 2026 •

edited

Loading

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci bot

Loading