-
Notifications
You must be signed in to change notification settings - Fork 13
GIE-460: Add mcpchecker evals for obs-mcp tools #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
3308701
feat: Add mcpchecker evals for obs-mcp tools
slashpai 2e558dd
feat(evals): enable parallel execution and update for mcpchecker nightly
slashpai bcd8391
chore(evals): organize evals into separate subdirectories
slashpai 5b59e75
feat(evals): switch default mcpchecker agent to openai:gpt-4o-mini
slashpai 652e6a7
docs: add MCPChecker evals to TESTING.md and fix README accuracy
slashpai 1e6415e
chore: add .env to gitignore
slashpai 2c416bd
docs(evals): streamline mcpchecker README agent configuration section
slashpai ffd253f
docs(evals): clarify agent vs judge LLM roles in mcpchecker README
slashpai bd2d421
chore: add Makefile targets to deploy additional kube-prometheus scra…
slashpai 2a884b9
evals: strengthen mcpchecker assertions and add hard-difficulty tasks
slashpai d2db200
evals: relax query assertions to accept either instant or range queries
slashpai 6105385
docs: add PROMPTS.md with example prompts for testing obs-mcp tools
slashpai 9609d1f
docs: move PROMPTS.md and METRICS_REFERENCE.md to docs/dev/
slashpai 09aa69f
evals: update mcpchecker config for v0.0.14
slashpai c83e3c8
evals: switch agent and judge model to gpt-5-nano
slashpai c79a357
chore: add makefile target to install mcpchecker
slashpai 4eaf77f
docs: consolidate mcpchecker eval docs and update for v0.0.15
slashpai 206362d
docs: add CATEGORY filter for mcpchecker evals
slashpai b46ecca
chore: Add 4 new mcpchecker tasks
slashpai 1abac33
evals: add smoke test, time range task, and fix weak contains check
slashpai c1852c5
evals: reduce default runs to 1, improve assertions, and add visualiz…
slashpai File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| # Metrics Reference | ||
|
|
||
| A quick reference mapping common questions to Prometheus metrics. Use this when `list_metrics` returns no relevant results—try the suggested regex patterns. Metrics vary by deployment (kube-prometheus, OpenShift, etc.); not all may exist in your cluster. | ||
|
|
||
| ## list_metrics Regex Tips | ||
|
|
||
| Prometheus uses **full-string** regex matching. `kube_pod` does not match `kube_pod_container_status_terminated`. Use: | ||
|
|
||
| - **Prefix search:** `kube_pod_container_status.*` (matches any metric starting with that prefix) | ||
| - **Substring search:** `.*terminated.*` (matches any metric containing "terminated") | ||
|
|
||
| ## Common Questions → Metrics | ||
|
|
||
| | Question | Suggested Metric(s) | list_metrics regex | Notes | | ||
| |----------|---------------------|--------------------|-------| | ||
| | OOMKilled containers | `kube_pod_container_status_last_terminated_reason` | `.*terminated_reason.*` | Check `reason="OOMKilled"` label. May not exist in all kube-state-metrics setups. | | ||
| | Pending pods | `kube_pod_status_phase` | `kube_pod_status_phase` | Filter `phase="Pending"` | | ||
| | Running pods | `kube_pod_status_phase` | `kube_pod_status_phase` | Filter `phase="Running"` | | ||
| | Crashlooping pods | `kube_pod_container_status_restarts_total` | `.*restarts.*` | Use range query with `increase()` | | ||
| | Pods created | `kube_pod_created` | `kube_pod_created` | Timestamp of pod creation | | ||
| | CPU usage (pods) | `container_cpu_usage_seconds_total` or `node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate` | `.*cpu.*` | Raw metric or pre-aggregated recording rule | | ||
| | Memory usage (pods) | `container_memory_working_set_bytes` or `node_namespace_pod_container:container_memory_working_set_bytes` | `.*memory.*` | Raw metric or pre-aggregated recording rule | | ||
| | Network traffic | `node_network_receive_bytes_total`, `node_network_transmit_bytes_total` | `node_network.*` | | | ||
| | Prometheus head series | `prometheus_tsdb_head_series` | `prometheus_tsdb.*` | | | ||
| | Prometheus WAL size | `prometheus_tsdb_wal_storage_size_bytes` | `prometheus_tsdb.*` | | | ||
| | Prometheus request rate | `prometheus_http_requests_total` | `prometheus_http.*` | Use `rate()` | | ||
|
|
||
| ## Query Efficiency | ||
|
|
||
| Agents should prefer aggregated PromQL over querying individual series. For example: | ||
|
|
||
| | Goal | Inefficient (N queries) | Efficient (1 query) | | ||
| |------|------------------------|---------------------| | ||
| | Top CPU pods | One `execute_range_query` per pod | `topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))` | | ||
| | Namespace resource usage | One query per namespace | `sum by (namespace) (container_memory_working_set_bytes)` | | ||
| | Pod restart rate | One query per pod | `topk(10, increase(kube_pod_container_status_restarts_total[1h]))` | | ||
|
|
||
| Use `topk()`, `bottomk()`, `sum by()`, `avg by()`, and `rate()` to answer questions in 1-3 queries instead of one per entity. | ||
|
|
||
| ## When a Metric Doesn't Exist | ||
|
|
||
| If `list_metrics` with the suggested regex returns nothing: | ||
|
|
||
| 1. The metric may not be scraped in your setup (e.g. `kube_pod_container_status_last_terminated_reason` requires specific kube-state-metrics config). | ||
| 2. Try broader patterns: `kube.*`, `node.*`, `container.*`. | ||
| 3. Inform the user that the metric is not available in their cluster. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| # Prompts You Can Try | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doc is for development reference |
||
|
|
||
| This document lists example prompts you can use to test obs-mcp when connected to Cursor or another MCP client. These prompts align with the [MCPChecker evals](../evals/mcpchecker/) and exercise different obs-mcp tools. | ||
|
|
||
| For metric discovery tips (e.g. regex behavior, common question → metric mapping), see [METRICS_REFERENCE.md](./METRICS_REFERENCE.md). | ||
|
|
||
| ## Metric Discovery | ||
|
|
||
| - List all available Prometheus metrics that contain 'kube' in the name. | ||
| - What node-related metrics are available in Prometheus? | ||
|
|
||
| ## Label Exploration | ||
|
|
||
| - What labels are available for the kube_pod_info metric? | ||
| - What are the unique namespace values for the kube_pod_info metric? | ||
| - How many time series exist for the kube_pod_info metric? Show the cardinality. | ||
|
|
||
| ## Queries | ||
|
|
||
| - Which pods are using the most CPU? | ||
| - Which pods are stuck in pending state? | ||
| - Which pods are receiving the most network traffic? | ||
| - How many head series does Prometheus have? | ||
| - What is the current storage size of the Prometheus WAL? | ||
| - How many requests per second are being made to Prometheus? | ||
| - How many pods were created in the last 5 minutes? | ||
| - Which pods were crashlooping in the last 5 minutes? | ||
|
|
||
| ## Alerts | ||
|
|
||
| - Are there any currently firing alerts in the cluster? | ||
| - Are there any active silences in Alertmanager? | ||
| - Check if there are any firing alerts. If there are, investigate the related metrics for the most critical alert and summarize what's happening. | ||
|
|
||
| ## Multi-Step Investigation | ||
|
|
||
| These prompts are part of the eval suite (hard difficulty) and test complex reasoning: | ||
|
|
||
| - Which namespace is consuming the most CPU and memory? Show me the top namespace for each. | ||
| - Is the cluster healthy? Give me an overview of any issues. | ||
|
|
||
| ## Bonus: Additional Prompts | ||
|
|
||
| These prompts go beyond the eval suite and test more complex workflows: | ||
|
|
||
| - What's the memory usage of pods in the monitoring namespace? | ||
| - Show me the container restart count for all pods over the last hour. | ||
| - Which nodes have the highest CPU utilization? | ||
| - What's the disk usage on the cluster nodes? | ||
| - Are any containers in OOMKilled state? | ||
| - How many pods are running in the cluster? | ||
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| mcpchecker-*-out.json | ||
| *-error.txt |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc is for development reference