Skip to content

prompt: Improve list_metrics prompt#44

Open
saswatamcode wants to merge 2 commits intorhobs:mainfrom
saswatamcode:promptlistm
Open

prompt: Improve list_metrics prompt#44
saswatamcode wants to merge 2 commits intorhobs:mainfrom
saswatamcode:promptlistm

Conversation

@saswatamcode
Copy link
Member

LLMs seem to get confused with what regex to pass to this tool. Guidance in the prompt helps improve it a bit

This improves it a bit
Screenshot 2026-03-11 at 10 02 31
Screenshot 2026-03-11 at 10 01 44
Screenshot 2026-03-11 at 10 02 07

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
@saswatamcode saswatamcode requested a review from a team March 11, 2026 04:34
@openshift-ci openshift-ci bot requested review from iNecas and slashpai March 11, 2026 04:34
@openshift-ci
Copy link

openshift-ci bot commented Mar 11, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: saswatamcode

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

- YOU MUST CALL THIS TOOL BEFORE ANY OTHER QUERY TOOL
- This tool MUST be called first for EVERY observability question to: 1. Discover what metrics actually exist in this environment 2. Find the EXACT metric name to use in queries 3. Avoid querying non-existent metrics 4. The 'name_regex' parameter should always be provided, and be a best guess of what the metric would be named like. 5. Do not use a blanket regex like .* or .+ in the 'name_regex' parameter. Use specific ones like kube.*, node.*, etc.
- REGEX PATTERN GUIDANCE: - Prometheus metrics are typically prefixed (e.g., 'prometheus_tsdb_head_series', 'kube_pod_status_phase') - To match metrics CONTAINING a substring, use wildcards: '.*tsdb.*' matches 'prometheus_tsdb_head_series' - Without wildcards, the pattern matches EXACTLY: 'tsdb' only matches a metric literally named 'tsdb' (which rarely exists) - Common patterns: 'kube_pod.*' (pods), '.*memory.*' (memory-related), 'node_.*' (node metrics) - If you get empty results, try adding '.*' before/after your search term
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How common you think it's would be more correct to match prefix vs. inner-word. We could also direct the LLM to use ^ to match prefix, and do "contain" search by default. Exact search would then be achieved by ^tsdb$ . It seems it would be hardly to make by mistake.

Copy link
Member

@slashpai slashpai Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a one scenario I tried on a kind cluster Are any containers in OOMKilled state?, there was no metric available for this in the cluster.

Tool Call History: "Are any containers in OOMKilled state?"

First Attempt

# Tool Parameters Result
1 list_metrics name_regex: "oom|OOM|terminated" []
2 list_metrics name_regex: "kube_pod_container_status" []
3 list_metrics name_regex: "container_status|container_state" []
4 list_metrics name_regex: "kube_pod_container" []
5 list_metrics name_regex: ".*reason.*" kube_pod_container_status_waiting_reason, kube_pod_init_container_status_terminated_reason, kube_pod_status_reason
6 list_metrics name_regex: ".*terminated.*" kube_pod_container_status_terminated, kube_pod_init_container_status_terminated, kube_pod_init_container_status_terminated_reason, ...
7 list_metrics name_regex: "kube_pod_container_status_last_terminated_reason" []
8 list_metrics name_regex: "kube_pod_container_status_terminated" kube_pod_container_status_terminated
9 get_label_names metric: "kube_pod_container_status_terminated" Error: Aborted
10 list_metrics name_regex: "kube_pod_container_status" []
11 list_metrics name_regex: ".*terminated_reason" kube_pod_init_container_status_terminated_reason
12 get_label_names metric: "kube_pod_init_container_status_terminated_reason" __name__, container, instance, job, namespace, pod, reason, uid
13 get_label_names metric: "kube_pod_container_status_terminated" __name__, container, instance, job, namespace, pod, uid
14 list_metrics name_regex: "last_terminated" []
15 get_label_values metric: "kube_pod_init_container_status_terminated_reason", label: "reason" ["Completed"]
16 execute_instant_query query: kube_pod_init_container_status_terminated_reason{reason="OOMKilled"} []
17 execute_instant_query query: kube_pod_container_status_terminated{namespace=~".+"} == 1 []
18 list_metrics name_regex: "container.*memory|memory.*container" User rejected
19 list_metrics name_regex: "node_vmstat_oom" Error: Aborted

Summary

  • Total: 19 tool calls
  • Outcome: No OOMKilled containers found
  • Note: kube_pod_container_status_last_terminated_reason is not present in this cluster. Only kube_pod_init_container_status_terminated_reason exists (init containers), with reason values ["Completed"].

So for a scenario like a metric corresponding to a question doesn't exist, we may need to alert user earlier than letting them try more and being frustrated?

Say

**FAIL FAST:** If list_metrics returns no relevant metric after 2–3 attempts, STOP and inform the user immediately. Do not keep trying more patterns. Say something like: "The metric for [question] does not appear to exist in your cluster—it may not be scraped by your monitoring setup." This avoids frustrating the user with many failed tool calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can vary based on model used but still may be we should have a description for fail fast also?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way could be have a new parameter attempt_number that the server uses to return a “metric not found” hint after N attempts

wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants