Skip to content

Problem after everything set up for autonomous-k8s-engineer #1

@markisa321

Description

@markisa321

I have tried to implement autonomous-k8s-engineer and followed all provided steps, but whatever I do, recive this error :) Can you suggest me what is problem with this, where to look for solution :)

root@master:/home/kubernetes# kubectl logs -f -n kagent -l app.kubernetes.io/name=self-healing-agent
2026-01-30 13:42:01,665 - google_adk.google.adk.runners - WARNING - Event from an unknown agent: system, event id: 38dc1bd3-da14-4cd1-9dc7-74be4f6ee333
2026-01-30 13:42:01,665 - google_adk.google.adk.runners - WARNING - Event from an unknown agent: system, event id: 38dc1bd3-da14-4cd1-9dc7-74be4f6ee333
2026-01-30 13:42:01,672 - httpx - INFO - HTTP Request: POST http://kagent-tools.kagent:8084/mcp "HTTP/1.1 200 OK"
13:42:01 - LiteLLM:INFO: utils.py:3258 -
LiteLLM completion() model= llama3:latest; provider = ollama_chat
2026-01-30 13:42:01,688 - LiteLLM - INFO -
LiteLLM completion() model= llama3:latest; provider = ollama_chat
2026-01-30 13:42:01,697 - httpx - INFO - HTTP Request: POST http://kagent-tools.kagent:8084/mcp "HTTP/1.1 200 OK"
2026-01-30 13:42:01,983 - httpx - INFO - HTTP Request: POST http://ollama.ollama.svc.cluster.local/api/show "HTTP/1.1 200 OK"
2026-01-30 13:42:02,251 - httpx - INFO - HTTP Request: POST http://ollama.ollama.svc.cluster.local/api/show "HTTP/1.1 200 OK"
2026-01-30 13:42:09,843 - httpx - INFO - HTTP Request: POST http://ollama.ollama.svc.cluster.local/api/chat "HTTP/1.1 200 OK"
2026-01-30 13:42:10,130 - httpx - INFO - HTTP Request: POST http://ollama.ollama.svc.cluster.local/api/show "HTTP/1.1 200 OK"
2026-01-30 13:42:10,437 - httpx - INFO - HTTP Request: POST http://ollama.ollama.svc.cluster.local/api/show "HTTP/1.1 200 OK"
2026-01-30 13:42:10,723 - httpx - INFO - HTTP Request: POST http://ollama.ollama.svc.cluster.local/api/show "HTTP/1.1 200 OK"
2026-01-30 13:42:10,736 - httpx - INFO - HTTP Request: POST http://kagent-tools.kagent:8084/mcp "HTTP/1.1 200 OK"
2026-01-30 13:42:10,756 - httpx - INFO - HTTP Request: POST http://kagent-controller.kagent:8083/api/sessions/b346f1e0-e3d1-43c4-b081-9e8d4184071b/events?user_id=A2A_USER_b346f1e0-e3d1-43c4-b081-9e8d4184071b "HTTP/1.1 201 Created"
2026-01-30 13:42:10,758 - kagent_adk.kagent.adk._agent_executor - ERROR - Error handling A2A request: Tool 'self_healing_agent' not found.
Available tools: k8s_apply_manifest, k8s_delete_resource, k8s_describe_resource, k8s_get_available_api_resources, k8s_get_events, k8s_get_pod_logs, k8s_get_resources, k8s_patch_resource, k8s_scale

Possible causes:

  1. LLM hallucinated the function name - review agent instruction clarity
  2. Tool not registered - verify agent.tools list
  3. Name mismatch - check for typos

Suggested fixes:

  • Review agent instruction to ensure tool usage is clear
  • Verify tool is included in agent.tools list
  • Check for typos in function name
    Traceback (most recent call last):
    File "/.kagent/packages/kagent-adk/src/kagent/adk/_agent_executor.py", line 146, in execute
    await self._handle_request(context, event_queue, runner, run_args)
    File "/.kagent/packages/kagent-adk/src/kagent/adk/_agent_executor.py", line 241, in _handle_request
    async for adk_event in agen:
    ...<7 lines>...
    await event_queue.enqueue_event(a2a_event)
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/runners.py", line 505, in run_async
    async for event in agen:
    yield event
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/runners.py", line 493, in _run_with_trace
    async for event in agen:
    yield event
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/runners.py", line 722, in _exec_with_plugin
    async for event in agen:
    ...<54 lines>...
    yield (modified_event if modified_event else event)
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/runners.py", line 482, in execute
    async for event in agen:
    yield event
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/agents/base_agent.py", line 294, in run_async
    async for event in agen:
    yield event
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/agents/llm_agent.py", line 460, in _run_async_impl
    async for event in agen:
    ...<5 lines>...
    should_pause = True
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/flows/llm_flows/base_llm_flow.py", line 370, in run_async
    async for event in agen:
    last_event = event
    yield event
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/flows/llm_flows/base_llm_flow.py", line 457, in _run_one_step_async
    async for event in agen:
    ...<3 lines>...
    yield event
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/flows/llm_flows/base_llm_flow.py", line 569, in _postprocess_async
    async for event in agen:
    yield event
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/flows/llm_flows/base_llm_flow.py", line 681, in _postprocess_handle_function_calls_async
    if function_response_event := await functions.handle_function_calls_async(
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    invocation_context, function_call_event, llm_request.tools_dict
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ):
    ^
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/flows/llm_flows/functions.py", line 198, in handle_function_calls_async
    return await handle_function_call_list_async(
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
    )
    ^
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/flows/llm_flows/functions.py", line 244, in handle_function_call_list_async
    function_response_events = await asyncio.gather(*tasks)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/flows/llm_flows/functions.py", line 338, in _execute_single_function_call_async
    raise tool_error
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/flows/llm_flows/functions.py", line 324, in _execute_single_function_call_async
    tool = _get_tool(function_call, tools_dict)
    File "/.kagent/.venv/lib/python3.13/site-packages/google/adk/flows/llm_flows/functions.py", line 729, in _get_tool
    raise ValueError(error_msg)
    ValueError: Tool 'self_healing_agent' not found.
    Available tools: k8s_apply_manifest, k8s_delete_resource, k8s_describe_resource, k8s_get_available_api_resources, k8s_get_events, k8s_get_pod_logs, k8s_get_resources, k8s_patch_resource, k8s_scale

Possible causes:

  1. LLM hallucinated the function name - review agent instruction clarity
  2. Tool not registered - verify agent.tools list
  3. Name mismatch - check for typos

Suggested fixes:

Agent instructions:

modelConfig: default-model-config
systemMessage: |
You are a Kubernetes Self-Healing Agent responsible for maintaining cluster health.

  IMPORTANT:
  - The words DETECT, DIAGNOSE, PLAN, EXECUTE, VERIFY are NOT tool/function names.
  - You MUST ONLY call tools from this list exactly as written:
    k8s_get_resources, k8s_get_pod_logs, k8s_get_events, k8s_describe_resource,
    k8s_scale, k8s_patch_resource, k8s_apply_manifest, k8s_delete_resource,
    k8s_get_available_api_resources
  - Never call tools named "detect" or "perform_health_check".

  ## Your Mission
  Monitor the cluster for issues and automatically remediate them without human intervention.

  ## Your Capabilities
  You have access to the following tools:
  - Kubernetes tools: Get pods, logs, events, apply/delete resources
  - Prometheus tools: Query metrics, check alerts, analyze trends

  ## Your Process
  When investigating an issue:
  1. DETECT: Check for firing alerts or anomalous metrics (conceptual step, NOT a tool)
  2. DIAGNOSE: Gather logs, events, and metrics to identify root cause (use k8s_get_* tools)
  3. PLAN: Determine the remediation action
  4. EXECUTE: Apply the fix using available Kubernetes tools
  5. VERIFY: Confirm the issue is resolved using k8s_get_resources and events

  ## Common Remediation Strategies

  ### CrashLoopBackOff
  - Use k8s_get_pod_logs and k8s_get_events
  - If caused by OOM: Increase memory limits using k8s_patch_resource

  ### Pod Not Ready
  - Check pod status and events
  - Verify service endpoints

  ### Scale to Zero
  - If a deployment has replicas=0 and should be running, use k8s_scale to restore it to 3

  ### Resource Exhaustion
  - Identify affected pods
  - Scale horizontally using k8s_scale

  ## Safety Rules
  - Never delete namespaces: kube-system, kagent, monitoring
  - Always verify changes after applying
  - Prefer scaling or patching over deleting resources
  - Log every action you take
  - When scaling, be precise and explicit
tools:
- mcpServer:
    kind: RemoteMCPServer
    name: kagent-tool-server
    toolNames:
    - k8s_get_resources
    - k8s_get_pod_logs
    - k8s_get_events
    - k8s_apply_manifest
    - k8s_delete_resource
    - k8s_patch_resource
    - k8s_describe_resource
    - k8s_get_available_api_resources
    - k8s_scale
    - prometheus_query
    - prometheus_get_alerts
  type: McpServer

description: An AI agent that monitors cluster health and automatically remediates

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions