AI-Powered Analysis Service for Kubernetes Incidents
The KubeRCA Agent is a Python-based analysis service that performs Root Cause Analysis (RCA) on Kubernetes incidents. It receives alert payloads from the Backend, collects relevant context from the Kubernetes cluster, and uses LLM via Strands Agents (Gemini/OpenAI/Anthropic) to generate comprehensive analysis reports. Prometheus, Loki, Tempo, and Istio-specific evidence are optional enrichers rather than hard requirements.
- AI-Powered RCA - Uses Strands Agents with Gemini/OpenAI/Anthropic for intelligent analysis
- Portable Kubernetes Baseline - Collects pod logs, events, workload, Service, and Endpoints evidence without requiring mesh/APM stacks
- Generic Manifest Read Tools - Reads namespaced core/CRD manifests via
apiVersion+resource - Optional Observability Enrichers - Uses Prometheus, Loki, and Tempo when configured, while degrading gracefully when they are unavailable
- Session Persistence - PostgreSQL-backed session history when
SESSION_DB_*is configured - Fallback Mode - Returns basic summary when the provider API key is unavailable
flowchart LR
BE[Backend] -->|POST /analyze| AG[Agent]
AG -->|Logs, Events| K8S[Kubernetes API]
AG -->|PromQL Query| PR[Prometheus]
AG -->|LLM Analysis| LLM[LLM Provider API]
AG -.->|Session Storage| PG[(PostgreSQL)]
AG -->|Analysis Result| BE
- Receive alert payload from Backend (triggered by Alertmanager webhook or manual resolve)
- Collect Kubernetes baseline context (logs, events, pod/workload status, Service, Endpoints)
- Optionally query Prometheus/Loki/Tempo when those backends are configured
- Build a capability-aware analysis prompt with collected context
- Send to Strands Agents (Gemini/OpenAI/Anthropic) for RCA
- Return structured analysis result
Note: Analysis is triggered both by Alertmanager webhook events and by manual alert resolve actions from the Frontend. Bulk resolve does not trigger Agent analysis.
| Category | Technology |
|---|---|
| Language | Python 3.10+ |
| Framework | FastAPI |
| AI/LLM | Strands Agents (Gemini/OpenAI/Anthropic) |
| Package Manager | uv |
| Linting | ruff |
| Testing | pytest |
| Container | Docker |
| CI/CD | GitHub Actions |
- Python 3.10+
- uv (Python package manager)
- (Optional) Kubernetes cluster access
- (Optional) AI provider API key
# Run in repository root
# (monorepo layout: cd agent/main)
make install
# or manually:
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"make run
# or manually:
uvicorn app.main:app --host 0.0.0.0 --port 8000The server starts at http://localhost:8000.
make test
# or:
pytestmake lint # Check code style
make format # Auto-format code| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Service info |
| GET | /ping |
Health check |
| GET | /healthz |
Kubernetes health probe |
| POST | /analyze |
Analyze single alert |
| POST | /summarize-incident |
Summarize resolved incident |
| GET | /openapi.json |
OpenAPI specification |
Analyzes a single alert with Kubernetes baseline context and optional observability enrichers.
Request:
{
"alert": {
"status": "firing",
"labels": {
"alertname": "HighMemoryUsage",
"severity": "critical",
"namespace": "default",
"pod": "example-pod"
},
"annotations": {
"summary": "High memory usage detected",
"description": "Pod memory usage > 90%"
},
"startsAt": "2024-01-01T00:00:00Z",
"fingerprint": "abc123"
},
"thread_ts": "1234567890.123456"
}Response:
{
"status": "ok",
"thread_ts": "1234567890.123456",
"analysis": "## Root Cause Analysis\n...",
"analysis_summary": "Brief summary of the issue",
"analysis_detail": "Detailed RCA markdown content...",
"analysis_quality": "medium",
"missing_data": ["alert.labels.pod"],
"warnings": ["namespace/pod_name missing from alert labels"],
"capabilities": {
"k8s_core": "ok",
"manifest_read": "ok",
"prometheus": "unavailable",
"tempo": "unavailable",
"mesh": "unknown",
"traffic_policy": "unknown"
},
"context": {
"namespace": "default",
"pod_name": "example-pod",
"analysis_quality": "medium"
},
"artifacts": []
}Summarizes a resolved incident with all associated alerts.
The analysis engine can inspect namespaced Kubernetes manifests (core and CRD) with:
get_manifest(namespace, api_version, resource, name)list_manifests(namespace, api_version, resource, label_selector=None, limit=20)
Examples:
get_manifest("bookinfo", "v1", "services", "reviews")
get_manifest("bookinfo", "networking.istio.io/v1", "virtualservices", "reviews-route")
list_manifests("bookinfo", "v1", "configmaps", "app=reviews", 10)
Notes:
api_versionsupports both core (v1) and grouped (group/version) formats.resourcemust be a plural resource name (for example:pods,services,virtualservices).- For security and readability, secret values are masked and
status/metadata.managedFieldsare omitted inget_manifestresponses.
| Variable | Description | Default |
|---|---|---|
AI_PROVIDER |
LLM provider (gemini, openai, anthropic) |
gemini |
GEMINI_API_KEY |
Gemini API key for Strands Agents | - |
OPENAI_API_KEY |
OpenAI API key for Strands Agents | - |
ANTHROPIC_API_KEY |
Anthropic API key for Strands Agents | - |
GEMINI_MODEL_ID |
Gemini model ID | gemini-3-flash-preview |
OPENAI_MODEL_ID |
OpenAI model ID | gpt-4o |
ANTHROPIC_MODEL_ID |
Anthropic model ID | claude-sonnet-4-20250514 |
ANTHROPIC_MAX_TOKENS |
Anthropic max output tokens | 4096 |
PROMETHEUS_URL |
Prometheus base URL | - (disabled) |
LOG_LEVEL |
Logging level | info |
WEB_CONCURRENCY |
Uvicorn worker count | 1 |
| Variable | Description | Default |
|---|---|---|
K8S_API_TIMEOUT_SECONDS |
K8s API timeout | 5 |
K8S_EVENT_LIMIT |
Max events to fetch | 25 |
K8S_LOG_TAIL_LINES |
Log lines to fetch | 25 |
| Variable | Description | Default |
|---|---|---|
PROMETHEUS_URL |
Prometheus base URL | - |
PROMETHEUS_HTTP_TIMEOUT_SECONDS |
HTTP timeout | 5 |
| Variable | Description | Default |
|---|---|---|
LOKI_URL |
Loki base URL | - |
LOKI_HTTP_TIMEOUT_SECONDS |
Loki HTTP timeout | 10 |
LOKI_TENANT_ID |
Loki tenant header value (X-Scope-OrgID) |
- |
| Variable | Description | Default |
|---|---|---|
TEMPO_URL |
Tempo base URL (e.g. http://tempo.monitoring.svc:3100) |
- |
TEMPO_HTTP_TIMEOUT_SECONDS |
Tempo HTTP timeout | 10 |
TEMPO_TENANT_ID |
Tempo tenant header value (X-Scope-OrgID) |
- |
TEMPO_TRACE_LIMIT |
Max traces fetched per alert | 5 |
TEMPO_LOOKBACK_MINUTES |
Minutes before startsAt for trace search window |
15 |
TEMPO_FORWARD_MINUTES |
Minutes after startsAt for trace search window |
5 |
| Variable | Description | Default |
|---|---|---|
PROMPT_TOKEN_BUDGET |
Approximate token budget | 32000 |
PROMPT_MAX_LOG_LINES |
Max log lines in prompt | 25 |
PROMPT_MAX_EVENTS |
Max events in prompt | 25 |
PROMPT_SUMMARY_MAX_ITEMS |
Max session summaries | 3 |
MASKING_REGEX_LIST_JSON |
JSON array of regex patterns for masking before LLM/DB response flows | [] |
| Variable | Description | Default |
|---|---|---|
LLM_RETRY_MAX_ATTEMPTS |
Max retry attempts for transient LLM API errors (5xx, 429) | 5 |
LLM_RETRY_MIN_WAIT |
Minimum backoff wait time in seconds | 1.0 |
LLM_RETRY_MAX_WAIT |
Maximum backoff wait time in seconds | 60.0 |
| Variable | Description |
|---|---|
SESSION_DB_HOST |
PostgreSQL host |
SESSION_DB_PORT |
PostgreSQL port |
SESSION_DB_NAME |
Database name |
SESSION_DB_USER |
Database user |
SESSION_DB_PASSWORD |
Database password |
If
GEMINI_API_KEY/OPENAI_API_KEY/ANTHROPIC_API_KEYis set, configureSESSION_DB_*together.
agent/
├── app/
│ ├── main.py # FastAPI entrypoint
│ ├── api/
│ │ ├── analysis.py # POST /analyze, POST /summarize-incident
│ │ └── health.py # GET /, /ping, /healthz
│ ├── clients/
│ │ ├── k8s.py
│ │ ├── prometheus.py
│ │ ├── tempo.py
│ │ ├── session_repository.py
│ │ ├── summary_store.py
│ │ ├── strands_agent.py
│ │ ├── strands_patch.py
│ │ └── llm_providers/
│ ├── core/
│ │ ├── config.py
│ │ ├── dependencies.py
│ │ └── logging.py
│ ├── models/
│ ├── schemas/
│ │ ├── alert.py
│ │ └── analysis.py
│ └── services/analysis.py
├── docs/openapi.json
├── scripts/export_openapi.py
├── tests/
├── Dockerfile
├── Makefile
└── pyproject.toml
| Command | Description |
|---|---|
make install |
Install dependencies |
make run |
Run development server |
make lint |
Run ruff linter |
make format |
Format code with ruff |
make test |
Run pytest |
make build IMAGE=<tag> |
Build Docker image |
make curl-analyze |
Test analyze endpoint |
make curl-analyze-local |
Test with local server |
When API changes, regenerate the OpenAPI spec:
uv run python scripts/export_openapi.pyThe spec is saved to docs/openapi.json.
Auto-regenerate OpenAPI on commit:
git config core.hooksPath .githooksrelease-please parses conventional commits; merge commits that include the PR title can be double-counted in the changelog.
- Prefer
Squash and mergeorRebase and merge. - If
Create a merge commitis used, keep the PR title non-conventional (e.g., "Merge PR #123"). - Use Conventional Commits for change commits that should appear in the changelog.
make test
# or:
pytest tests/Requires a Kubernetes cluster and provider API key:
AI_PROVIDER=gemini GEMINI_API_KEY=xxx KUBECONFIG=~/.kube/config make test-analysis-localcurl -X POST http://localhost:8000/analyze \
-H 'Content-Type: application/json' \
-d '{
"alert": {
"status": "firing",
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"namespace": "default",
"pod": "example-pod"
},
"annotations": {
"summary": "Test summary",
"description": "Test description"
},
"startsAt": "2024-01-01T00:00:00Z",
"fingerprint": "test-fingerprint"
},
"thread_ts": "test-thread"
}'docker build -t kube-rca-agent .
# or:
make build IMAGE=kube-rca-agentdocker run -d -p 8000:8000 \
-e GEMINI_API_KEY=your-api-key \
kube-rca-agentTest the agent with a real OOMKilled scenario in Kubernetes.
kubectlwith cluster accesskube-rcanamespace exists
# Create OOM pod only
make test-oom-only
# Full test with analysis
GEMINI_API_KEY=xxx make test-analysis-local
# Cleanup
make cleanup-oom| Variable | Description | Default |
|---|---|---|
KUBE_CONTEXT |
Kubernetes context | current |
LOCAL_OOM_NAMESPACE |
Test namespace | kube-rca |
LOCAL_OOM_DEPLOYMENT |
Deployment name | oomkilled-test |
LOCAL_OOM_MEMORY_LIMIT |
Memory limit | 64Mi |
CLEANUP |
Auto-cleanup after test | false |
When GEMINI_API_KEY is not set, the agent returns a fallback summary:
{
"status": "ok",
"analysis": {
"summary": "Alert received but AI analysis unavailable",
"detail": "Basic alert information..."
}
}- KubeRCA Backend - Go REST API server
- KubeRCA Frontend - React web dashboard
- Helm Charts - Kubernetes deployment
- Chaos Scenarios - Failure injection tests
This project is part of KubeRCA, licensed under the MIT License. See the LICENSE file for details.
