Conversation
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: saswatamcode The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
|
||
| - YOU MUST CALL THIS TOOL BEFORE ANY OTHER QUERY TOOL | ||
| - This tool MUST be called first for EVERY observability question to: 1. Discover what metrics actually exist in this environment 2. Find the EXACT metric name to use in queries 3. Avoid querying non-existent metrics 4. The 'name_regex' parameter should always be provided, and be a best guess of what the metric would be named like. 5. Do not use a blanket regex like .* or .+ in the 'name_regex' parameter. Use specific ones like kube.*, node.*, etc. | ||
| - REGEX PATTERN GUIDANCE: - Prometheus metrics are typically prefixed (e.g., 'prometheus_tsdb_head_series', 'kube_pod_status_phase') - To match metrics CONTAINING a substring, use wildcards: '.*tsdb.*' matches 'prometheus_tsdb_head_series' - Without wildcards, the pattern matches EXACTLY: 'tsdb' only matches a metric literally named 'tsdb' (which rarely exists) - Common patterns: 'kube_pod.*' (pods), '.*memory.*' (memory-related), 'node_.*' (node metrics) - If you get empty results, try adding '.*' before/after your search term |
There was a problem hiding this comment.
How common you think it's would be more correct to match prefix vs. inner-word. We could also direct the LLM to use ^ to match prefix, and do "contain" search by default. Exact search would then be achieved by ^tsdb$ . It seems it would be hardly to make by mistake.
There was a problem hiding this comment.
Here is a one scenario I tried on a kind cluster Are any containers in OOMKilled state?, there was no metric available for this in the cluster.
Tool Call History: "Are any containers in OOMKilled state?"
First Attempt
| # | Tool | Parameters | Result |
|---|---|---|---|
| 1 | list_metrics |
name_regex: "oom|OOM|terminated" |
[] |
| 2 | list_metrics |
name_regex: "kube_pod_container_status" |
[] |
| 3 | list_metrics |
name_regex: "container_status|container_state" |
[] |
| 4 | list_metrics |
name_regex: "kube_pod_container" |
[] |
| 5 | list_metrics |
name_regex: ".*reason.*" |
kube_pod_container_status_waiting_reason, kube_pod_init_container_status_terminated_reason, kube_pod_status_reason |
| 6 | list_metrics |
name_regex: ".*terminated.*" |
kube_pod_container_status_terminated, kube_pod_init_container_status_terminated, kube_pod_init_container_status_terminated_reason, ... |
| 7 | list_metrics |
name_regex: "kube_pod_container_status_last_terminated_reason" |
[] |
| 8 | list_metrics |
name_regex: "kube_pod_container_status_terminated" |
kube_pod_container_status_terminated |
| 9 | get_label_names |
metric: "kube_pod_container_status_terminated" |
Error: Aborted |
| 10 | list_metrics |
name_regex: "kube_pod_container_status" |
[] |
| 11 | list_metrics |
name_regex: ".*terminated_reason" |
kube_pod_init_container_status_terminated_reason |
| 12 | get_label_names |
metric: "kube_pod_init_container_status_terminated_reason" |
__name__, container, instance, job, namespace, pod, reason, uid |
| 13 | get_label_names |
metric: "kube_pod_container_status_terminated" |
__name__, container, instance, job, namespace, pod, uid |
| 14 | list_metrics |
name_regex: "last_terminated" |
[] |
| 15 | get_label_values |
metric: "kube_pod_init_container_status_terminated_reason", label: "reason" |
["Completed"] |
| 16 | execute_instant_query |
query: kube_pod_init_container_status_terminated_reason{reason="OOMKilled"} |
[] |
| 17 | execute_instant_query |
query: kube_pod_container_status_terminated{namespace=~".+"} == 1 |
[] |
| 18 | list_metrics |
name_regex: "container.*memory|memory.*container" |
User rejected |
| 19 | list_metrics |
name_regex: "node_vmstat_oom" |
Error: Aborted |
Summary
- Total: 19 tool calls
- Outcome: No OOMKilled containers found
- Note:
kube_pod_container_status_last_terminated_reasonis not present in this cluster. Onlykube_pod_init_container_status_terminated_reasonexists (init containers), withreasonvalues["Completed"].
So for a scenario like a metric corresponding to a question doesn't exist, we may need to alert user earlier than letting them try more and being frustrated?
Say
**FAIL FAST:** If list_metrics returns no relevant metric after 2–3 attempts, STOP and inform the user immediately. Do not keep trying more patterns. Say something like: "The metric for [question] does not appear to exist in your cluster—it may not be scraped by your monitoring setup." This avoids frustrating the user with many failed tool calls.
There was a problem hiding this comment.
This can vary based on model used but still may be we should have a description for fail fast also?
There was a problem hiding this comment.
Another way could be have a new parameter attempt_number that the server uses to return a “metric not found” hint after N attempts
wdyt?
LLMs seem to get confused with what regex to pass to this tool. Guidance in the prompt helps improve it a bit
This improves it a bit


