AI-powered Kubernetes incident analysis and Root Cause Analysis
Turn Kubernetes alerts into actionable root-cause analysis — in seconds, not hours.
⭐ If KubeRCA is useful to you, please consider giving the repo a star. It helps the project reach more operators and brings in more contributors.
A few screens from a running KubeRCA install. The dashboards correlate alerts to incidents, and every incident gets an LLM-generated RCA summary that lands in Slack and the UI together.
Centralized view of all incidents — search, filter by severity / status, and export to CSV.
Every Alertmanager alert is correlated to its parent incident, with namespace and severity context attached.
Each firing alert posts to Slack, and the LLM-generated RCA summary (root cause, impact, recommended action) is added in the same thread automatically.
Each incident exposes a structured analysis: root cause, impact scope, mitigation steps taken, preventive recommendations, and related alerts pulled from the embedding-based similar-incident search.
Each alert tracks its own analysis lifecycle (firing → resolved), with the LLM's summary, evidence, recommended remediation, and operator feedback in one place.
KubeRCA is an open-source tool that turns Kubernetes alerts into actionable incident context, AI-assisted analysis, and operator workflows.
It is built for the gap between "an alert fired" and "we understand what happened." In many Kubernetes environments, operators still have to gather logs, metrics, events, traces, and past incident notes by hand before they can even start reasoning about root cause. KubeRCA shortens that loop by connecting alert intake, context collection, RCA generation, Slack thread delivery, and dashboard workflows into one system.
- Context collection is slow: teams lose time jumping between Kubernetes, monitoring tools, Slack, and dashboards after every alert.
- RCA quality varies by operator: the initial hypothesis, evidence gathered, and explanation format often depend on who is on call.
- Past incidents are hard to reuse: similar failures may have already happened, but the response knowledge is rarely searchable in a structured way.
KubeRCA addresses those gaps by capturing incident data as it happens, generating explainable RCA summaries, and making related incidents, feedback, and follow-up discussion part of the same workflow.
KubeRCA is a strong fit when you already operate Kubernetes with Alertmanager and want faster incident triage, more consistent RCA, and a searchable incident history.
- Teams receiving production or staging alerts through Alertmanager.
- Operators who use Slack threads or a web dashboard as their incident working surface.
- Environments where similar incidents recur and historical reuse is valuable.
- Platforms that want LLM-assisted triage without replacing their current observability stack.
- Log-only workflows without structured alerts.
- Teams looking for a generic APM replacement rather than incident-focused RCA assistance.
- Organizations that want a fully autonomous remediation engine instead of operator-in-the-loop analysis.
flowchart TD
AM[Alertmanager]
SL[Slack]
LLM[LLM Provider]
K8S[Kubernetes API]
PR[Prometheus]
TP[Tempo]
subgraph KubeRCA
FE[Frontend]
BE[Backend]
AG[Agent]
DB[(PostgreSQL + pgvector)]
end
AM -->|Webhook| BE
FE <-->|REST + SSE| BE
BE -->|Analyze / Summarize / Chat| AG
BE -->|Thread notifications| SL
BE <-->|Incidents / alerts / embeddings| DB
AG -->|Cluster context| K8S
AG -->|Metrics| PR
AG -.->|Trace context| TP
AG -->|Inference| LLM
| Stage | What happens |
|---|---|
| Alert intake | Alertmanager sends alerts to the Backend via POST /webhook/alertmanager. |
| Incident creation | Backend creates or updates incidents, stores alerts, and tracks Slack thread metadata. |
| RCA generation | Backend calls the Agent, which collects Kubernetes and observability context and runs LLM analysis. |
| Team visibility | Backend publishes results to Slack threads and streams updates to the Frontend over SSE. |
| Resolution workflows | Operators can resolve incidents, manually resolve alerts, search similar incidents, leave feedback, and use in-app chat. |
| Knowledge reuse | Incident summaries and embeddings are stored in PostgreSQL + pgvector for later search and review. |
For the full runtime sequence and API surface, see Architecture Details and the diagrams under docs/diagrams.
- Receive alerts through Alertmanager webhook integration and map them into incidents and alerts automatically.
- Collect Kubernetes, Prometheus, and Tempo context around the affected workload.
- Run RCA with Strands Agents using
gemini,openai, oranthropic.
- Publish incident updates and RCA summaries into threaded Slack conversations.
- Stream incident and alert state changes to the dashboard through SSE with polling fallback.
- Support manual alert resolve, feedback comments and votes, and context-aware chat.
- Summarize resolved incidents and store embeddings in PostgreSQL + pgvector.
- Search for similar past incidents to reuse investigation patterns and responses.
- Keep incident, alert, feedback, and webhook routing data in one system of record.
- Deploy the full stack with the
kube-rcaHelm chart. - Use local auth by default or enable Google OIDC for SSO-style access.
- Run on top of your existing Kubernetes and monitoring setup without replacing it.
Use this path if you want to evaluate the system quickly rather than read all documentation first.
helm upgrade --install kube-rca oci://public.ecr.aws/r5b7j2e4/kube-rca-ecr/charts/kube-rca \
--namespace kube-rca --create-namespace \
-f values.yamlMinimal values.yaml example:
postgresql:
auth:
existingSecret: ""
password: "change-me"
backend:
slack:
enabled: false
postgresql:
secret:
existingSecret: ""
embedding:
apiKey:
existingSecret: ""
agent:
aiProvider: "gemini"
gemini:
apiKey: "YOUR_GEMINI_API_KEY"
secret:
existingSecret: ""
frontend:
ingress:
enabled: true
hosts:
- "kube-rca.example.com"receivers:
- name: "kube-rca"
webhook_configs:
- url: "http://kube-rca-backend.kube-rca.svc.cluster.local:8080/webhook/alertmanager"
send_resolved: true
route:
receiver: "kube-rca"- Trigger or forward a real alert from your cluster.
- Verify that the Backend stores the alert, the Agent returns analysis, and the dashboard updates in realtime.
- If Slack is enabled, confirm that the incident and RCA are posted in the same thread.
- Resolve an incident and verify incident summarization and similarity search.
- Try manual alert resolve for a case where Alertmanager resolution may be delayed.
- Open the chat panel or feedback flow to see how operators can refine or discuss the analysis.
Need step-by-step setup? See the full Helm Chart README and the installation guide:
- 한국어 — 설치 가이드
- English — Installation Guide
Both walk through the first scenario end-to-end.
| Area | Default / Supported | Notes |
|---|---|---|
| Alert source | Alertmanager | Primary ingestion path for incidents and alerts |
| Notification | Slack | Optional but strongly aligned with threaded incident workflows |
| AI providers | Gemini, OpenAI, Anthropic | Selected through agent.aiProvider |
| Cluster context | Kubernetes API | Core runtime evidence source |
| Observability enrichers | Prometheus, Tempo | Additional signal sources when configured |
| Database | PostgreSQL + pgvector | Incident, feedback, and embedding storage |
| Auth | Local auth, Google OIDC | OIDC is optional |
| Deployment | Helm, ArgoCD-oriented usage | The chart is the main deployment entrypoint |
.
├── backend/ Go API for auth, incidents, alerts, embeddings, feedback, chat, and SSE
├── frontend/ React dashboard for incident operations and realtime views
├── agent/ FastAPI analysis service for RCA, incident summaries, and chat
├── charts/ Helm chart for deploying KubeRCA into Kubernetes
├── chaos/ Chaos Mesh scenarios and helper scripts for failure injection
└── docs/ Architecture, guides, diagrams, and assets
# Backend
cd backend
go test ./...
# Agent
cd agent
make install
make test
# Frontend
cd frontend
npm ci
npm run dev
# Helm
helm repo add bitnami https://charts.bitnami.com/bitnami --force-update
helm dependency build charts/kube-rca
helm lint charts/kube-rcaNo. Slack is optional. You can start with the dashboard-only flow and enable Slack later if threaded notifications fit your incident process.
No. KubeRCA can start with Kubernetes alerts and the core cluster context path. Additional observability backends improve analysis depth when they are available.
Yes. The chart supports PostgreSQL configuration beyond the bundled dependency. See the chart README for the installation-specific values.
The project examples use Gemini for the quickest path, but OpenAI and Anthropic are also supported through the same provider abstraction.
Yes. Local auth works without OIDC. Google OIDC is optional and can be enabled later when you want centralized login.
Currently, KubeRCA supports Korean and English.
Issues, pull requests, and design feedback are all welcome. Before opening a PR, please read:
- CONTRIBUTING.md — development setup per component, Conventional Commits, DCO sign-off, PR workflow.
- CODE_OF_CONDUCT.md — community expectations.
- GOVERNANCE.md — roles, decision making, and how Maintainers are added.
- SECURITY.md — how to report vulnerabilities privately.
If you change behavior across backend, agent, frontend, or Helm values, keep the documentation in docs/ and component READMEs aligned with the implementation.
⭐ Liked what you saw? A star on the repo is the simplest way to help the project grow.
This project is licensed under the Apache License, Version 2.0. See LICENSE and NOTICE for details.





