KubeRCA Agent

AI-Powered Analysis Service for Kubernetes Incidents

Overview

The KubeRCA Agent is a Python-based analysis service that performs Root Cause Analysis (RCA) on Kubernetes incidents. It receives alert payloads from the Backend, collects relevant context from the Kubernetes cluster, and uses LLM via Strands Agents (Gemini/OpenAI/Anthropic) to generate comprehensive analysis reports. Prometheus, Loki, Tempo, and Istio-specific evidence are optional enrichers rather than hard requirements.

Key Features

AI-Powered RCA - Uses Strands Agents with Gemini/OpenAI/Anthropic for intelligent analysis
Portable Kubernetes Baseline - Collects pod logs, events, workload, Service, and Endpoints evidence without requiring mesh/APM stacks
Generic Manifest Read Tools - Reads namespaced core/CRD manifests via apiVersion + resource
Optional Observability Enrichers - Uses Prometheus, Loki, and Tempo when configured, while degrading gracefully when they are unavailable
Session Persistence - PostgreSQL-backed session history when SESSION_DB_* is configured
Fallback Mode - Returns basic summary when the provider API key is unavailable

Architecture

flowchart LR
  BE[Backend] -->|POST /analyze| AG[Agent]
  AG -->|Logs, Events| K8S[Kubernetes API]
  AG -->|PromQL Query| PR[Prometheus]
  AG -->|LLM Analysis| LLM[LLM Provider API]
  AG -.->|Session Storage| PG[(PostgreSQL)]
  AG -->|Analysis Result| BE

Analysis Flow

Receive alert payload from Backend (triggered by Alertmanager webhook or manual resolve)
Collect Kubernetes baseline context (logs, events, pod/workload status, Service, Endpoints)
Optionally query Prometheus/Loki/Tempo when those backends are configured
Build a capability-aware analysis prompt with collected context
Send to Strands Agents (Gemini/OpenAI/Anthropic) for RCA
Return structured analysis result

Note: Analysis is triggered both by Alertmanager webhook events and by manual alert resolve actions from the Frontend. Bulk resolve does not trigger Agent analysis.

Tech Stack

Category	Technology
Language	Python 3.10+
Framework	FastAPI
AI/LLM	Strands Agents (Gemini/OpenAI/Anthropic)
Package Manager	uv
Linting	ruff
Testing	pytest
Container	Docker
CI/CD	GitHub Actions

Quick Start

Prerequisites

Python 3.10+
uv (Python package manager)
(Optional) Kubernetes cluster access
(Optional) AI provider API key

Installation

# Run in repository root
# (monorepo layout: cd agent/main)
make install
# or manually:
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

Run Development Server

make run
# or manually:
uvicorn app.main:app --host 0.0.0.0 --port 8000

The server starts at http://localhost:8000.

Run Tests

make test
# or:
pytest

Lint & Format

make lint    # Check code style
make format  # Auto-format code

API Endpoints

Method	Endpoint	Description
GET	`/`	Service info
GET	`/ping`	Health check
GET	`/healthz`	Kubernetes health probe
POST	`/analyze`	Analyze single alert
POST	`/summarize-incident`	Summarize resolved incident
GET	`/openapi.json`	OpenAPI specification

POST /analyze

Analyzes a single alert with Kubernetes baseline context and optional observability enrichers.

Request:

{
  "alert": {
    "status": "firing",
    "labels": {
      "alertname": "HighMemoryUsage",
      "severity": "critical",
      "namespace": "default",
      "pod": "example-pod"
    },
    "annotations": {
      "summary": "High memory usage detected",
      "description": "Pod memory usage > 90%"
    },
    "startsAt": "2024-01-01T00:00:00Z",
    "fingerprint": "abc123"
  },
  "thread_ts": "1234567890.123456"
}

Response:

{
  "status": "ok",
  "thread_ts": "1234567890.123456",
  "analysis": "## Root Cause Analysis\n...",
  "analysis_summary": "Brief summary of the issue",
  "analysis_detail": "Detailed RCA markdown content...",
  "analysis_quality": "medium",
  "missing_data": ["alert.labels.pod"],
  "warnings": ["namespace/pod_name missing from alert labels"],
  "capabilities": {
    "k8s_core": "ok",
    "manifest_read": "ok",
    "prometheus": "unavailable",
    "tempo": "unavailable",
    "mesh": "unknown",
    "traffic_policy": "unknown"
  },
  "context": {
    "namespace": "default",
    "pod_name": "example-pod",
    "analysis_quality": "medium"
  },
  "artifacts": []
}

POST /summarize-incident

Summarizes a resolved incident with all associated alerts.

Generic Manifest Read Tools

The analysis engine can inspect namespaced Kubernetes manifests (core and CRD) with:

get_manifest(namespace, api_version, resource, name)
list_manifests(namespace, api_version, resource, label_selector=None, limit=20)

Examples:

get_manifest("bookinfo", "v1", "services", "reviews")
get_manifest("bookinfo", "networking.istio.io/v1", "virtualservices", "reviews-route")
list_manifests("bookinfo", "v1", "configmaps", "app=reviews", 10)

Notes:

api_version supports both core (v1) and grouped (group/version) formats.
resource must be a plural resource name (for example: pods, services, virtualservices).
For security and readability, secret values are masked and status/metadata.managedFields are omitted in get_manifest responses.

Configuration

Environment Variables

Variable	Description	Default
`AI_PROVIDER`	LLM provider (`gemini`, `openai`, `anthropic`)	`gemini`
`GEMINI_API_KEY`	Gemini API key for Strands Agents	-
`OPENAI_API_KEY`	OpenAI API key for Strands Agents	-
`ANTHROPIC_API_KEY`	Anthropic API key for Strands Agents	-
`GEMINI_MODEL_ID`	Gemini model ID	`gemini-3-flash-preview`
`OPENAI_MODEL_ID`	OpenAI model ID	`gpt-4o`
`ANTHROPIC_MODEL_ID`	Anthropic model ID	`claude-sonnet-4-20250514`
`ANTHROPIC_MAX_TOKENS`	Anthropic max output tokens	`4096`
`PROMETHEUS_URL`	Prometheus base URL	- (disabled)
`LOG_LEVEL`	Logging level	`info`
`WEB_CONCURRENCY`	Uvicorn worker count	`1`

Kubernetes Context

Variable	Description	Default
`K8S_API_TIMEOUT_SECONDS`	K8s API timeout	`5`
`K8S_EVENT_LIMIT`	Max events to fetch	`25`
`K8S_LOG_TAIL_LINES`	Log lines to fetch	`25`

Prometheus

Variable	Description	Default
`PROMETHEUS_URL`	Prometheus base URL	-
`PROMETHEUS_HTTP_TIMEOUT_SECONDS`	HTTP timeout	`5`

Loki (Historical Logs)

Variable	Description	Default
`LOKI_URL`	Loki base URL	-
`LOKI_HTTP_TIMEOUT_SECONDS`	Loki HTTP timeout	`10`
`LOKI_TENANT_ID`	Loki tenant header value (`X-Scope-OrgID`)	-

Tempo (APM Traces)

Variable	Description	Default
`TEMPO_URL`	Tempo base URL (e.g. `http://tempo.monitoring.svc:3100`)	-
`TEMPO_HTTP_TIMEOUT_SECONDS`	Tempo HTTP timeout	`10`
`TEMPO_TENANT_ID`	Tempo tenant header value (`X-Scope-OrgID`)	-
`TEMPO_TRACE_LIMIT`	Max traces fetched per alert	`5`
`TEMPO_LOOKBACK_MINUTES`	Minutes before `startsAt` for trace search window	`15`
`TEMPO_FORWARD_MINUTES`	Minutes after `startsAt` for trace search window	`5`

Prompt Configuration

Variable	Description	Default
`PROMPT_TOKEN_BUDGET`	Approximate token budget	`32000`
`PROMPT_MAX_LOG_LINES`	Max log lines in prompt	`25`
`PROMPT_MAX_EVENTS`	Max events in prompt	`25`
`PROMPT_SUMMARY_MAX_ITEMS`	Max session summaries	`3`
`MASKING_REGEX_LIST_JSON`	JSON array of regex patterns for masking before LLM/DB response flows	`[]`

LLM Retry

Variable	Description	Default
`LLM_RETRY_MAX_ATTEMPTS`	Max retry attempts for transient LLM API errors (5xx, 429)	`5`
`LLM_RETRY_MIN_WAIT`	Minimum backoff wait time in seconds	`1.0`
`LLM_RETRY_MAX_WAIT`	Maximum backoff wait time in seconds	`60.0`

Session Storage (Required when LLM provider key is set)

Variable	Description
`SESSION_DB_HOST`	PostgreSQL host
`SESSION_DB_PORT`	PostgreSQL port
`SESSION_DB_NAME`	Database name
`SESSION_DB_USER`	Database user
`SESSION_DB_PASSWORD`	Database password

If GEMINI_API_KEY/OPENAI_API_KEY/ANTHROPIC_API_KEY is set, configure SESSION_DB_* together.

Project Structure

agent/
├── app/
│   ├── main.py                # FastAPI entrypoint
│   ├── api/
│   │   ├── analysis.py        # POST /analyze, POST /summarize-incident
│   │   └── health.py          # GET /, /ping, /healthz
│   ├── clients/
│   │   ├── k8s.py
│   │   ├── prometheus.py
│   │   ├── tempo.py
│   │   ├── session_repository.py
│   │   ├── summary_store.py
│   │   ├── strands_agent.py
│   │   ├── strands_patch.py
│   │   └── llm_providers/
│   ├── core/
│   │   ├── config.py
│   │   ├── dependencies.py
│   │   └── logging.py
│   ├── models/
│   ├── schemas/
│   │   ├── alert.py
│   │   └── analysis.py
│   └── services/analysis.py
├── docs/openapi.json
├── scripts/export_openapi.py
├── tests/
├── Dockerfile
├── Makefile
└── pyproject.toml

Development

Makefile Commands

Command	Description
`make install`	Install dependencies
`make run`	Run development server
`make lint`	Run ruff linter
`make format`	Format code with ruff
`make test`	Run pytest
`make build IMAGE=<tag>`	Build Docker image
`make curl-analyze`	Test analyze endpoint
`make curl-analyze-local`	Test with local server

Export OpenAPI Spec

When API changes, regenerate the OpenAPI spec:

uv run python scripts/export_openapi.py

The spec is saved to docs/openapi.json.

Git Hooks (Optional)

Auto-regenerate OpenAPI on commit:

git config core.hooksPath .githooks

Contributing

Merge Policy (release-please)

release-please parses conventional commits; merge commits that include the PR title can be double-counted in the changelog.

Prefer Squash and merge or Rebase and merge.
If Create a merge commit is used, keep the PR title non-conventional (e.g., "Merge PR #123").
Use Conventional Commits for change commits that should appear in the changelog.

Testing

Unit Tests

make test
# or:
pytest tests/

Local Integration Test

Requires a Kubernetes cluster and provider API key:

AI_PROVIDER=gemini GEMINI_API_KEY=xxx KUBECONFIG=~/.kube/config make test-analysis-local

Manual API Test

curl -X POST http://localhost:8000/analyze \
  -H 'Content-Type: application/json' \
  -d '{
    "alert": {
      "status": "firing",
      "labels": {
        "alertname": "TestAlert",
        "severity": "warning",
        "namespace": "default",
        "pod": "example-pod"
      },
      "annotations": {
        "summary": "Test summary",
        "description": "Test description"
      },
      "startsAt": "2024-01-01T00:00:00Z",
      "fingerprint": "test-fingerprint"
    },
    "thread_ts": "test-thread"
  }'

Docker

Build Image

docker build -t kube-rca-agent .
# or:
make build IMAGE=kube-rca-agent

Run Container

docker run -d -p 8000:8000 \
  -e GEMINI_API_KEY=your-api-key \
  kube-rca-agent

OOMKilled Test Scenario

Test the agent with a real OOMKilled scenario in Kubernetes.

Prerequisites

kubectl with cluster access
kube-rca namespace exists

Run Test

# Create OOM pod only
make test-oom-only

# Full test with analysis
GEMINI_API_KEY=xxx make test-analysis-local

# Cleanup
make cleanup-oom

Environment Variables

Variable	Description	Default
`KUBE_CONTEXT`	Kubernetes context	current
`LOCAL_OOM_NAMESPACE`	Test namespace	`kube-rca`
`LOCAL_OOM_DEPLOYMENT`	Deployment name	`oomkilled-test`
`LOCAL_OOM_MEMORY_LIMIT`	Memory limit	`64Mi`
`CLEANUP`	Auto-cleanup after test	`false`

Fallback Behavior

When GEMINI_API_KEY is not set, the agent returns a fallback summary:

{
  "status": "ok",
  "analysis": {
    "summary": "Alert received but AI analysis unavailable",
    "detail": "Basic alert information..."
  }
}

Related Components

KubeRCA Backend - Go REST API server
KubeRCA Frontend - React web dashboard
Helm Charts - Kubernetes deployment
Chaos Scenarios - Failure injection tests

License

This project is part of KubeRCA, licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.github/workflows		.github/workflows
app		app
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
release-please-config.json		release-please-config.json
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

KubeRCA Agent

Overview

Key Features

Architecture

Analysis Flow

Tech Stack

Quick Start

Prerequisites

Installation

Run Development Server

Run Tests

Lint & Format

API Endpoints

POST /analyze

POST /summarize-incident

Generic Manifest Read Tools

Configuration

Environment Variables

Kubernetes Context

Prometheus

Loki (Historical Logs)

Tempo (APM Traces)

Prompt Configuration

LLM Retry

Session Storage (Required when LLM provider key is set)

Project Structure

Development

Makefile Commands

Export OpenAPI Spec

Git Hooks (Optional)

Contributing

Merge Policy (release-please)

Testing

Unit Tests

Local Integration Test

Manual API Test

Docker

Build Image

Run Container

OOMKilled Test Scenario

Prerequisites

Run Test

Environment Variables

Fallback Behavior

Related Components

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages