Find wasted Kubernetes spend before it lands on the bill.
Kopilot is an approval-gated AI operator for Kubernetes teams. It investigates over-provisioned workloads, idle resources, and orphaned storage from one prompt, explains the evidence in plain English, and recommends the next safe action.
Public CLI note: install the Python package as
kubedevaiops, then run the brandedkopilotcommand. Internal package paths and Kubernetes API groups still usekubedevaiopsduring the transition.
Built for
- Platform and DevOps teams managing noisy multi-namespace clusters
- Operators who want self-hosted automation with audit logs
- Teams that want recommendations first and approval before risky cleanup
pip install kubedevaiops
kopilot ask "Find over-provisioned deployments and orphaned PVCs across all namespaces"What the first run should give you:
- Evidence from cluster metrics and live object inspection
- Plain-English explanation of where waste is coming from
- Rightsizing and cleanup recommendations, with destructive follow-up left behind approval gates
- Read-first by default: investigations start with
get,describe,logs, and metrics collection before any mutation is considered. - Approval-gated changes: destructive actions such as deletes, drains, and cordons require explicit approval.
- Protected namespaces: high-risk namespaces are blocked from destructive actions.
- Audit trail: every delegated task and executed command is recorded.
- Self-hosted posture: the core deployment model assumes you control the cluster access path.
| Use case | Prompt | Expected outcome |
|---|---|---|
| Rightsizing workloads | Find over-provisioned deployments across all namespaces |
Highlights requests vs observed usage and suggests safer request targets |
| Idle resource review | Show me resources that look idle and worth review |
Flags low-usage workloads without auto-deleting them |
| Orphaned storage | List PVCs that are not attached to running workloads |
Surfaces old storage objects with namespace and age context |
| Safe follow-up plan | Recommend the next cost-saving actions for staging |
Produces an ordered plan that still keeps destructive cleanup approval-gated |
Kopilot now exposes one concrete interop surface and two discovery surfaces:
kopilot mcpruns an MCP server for external agent clientsGET /skills/portableexposes enabled skills as portable manifestsGET /.well-known/agent-manifest.jsonexposes an async-first discovery document
kopilot mcp --transport stdioCurrent skill model:
- Built-in YAML skills ship with the package
- Mounted YAML skills can be loaded through
SKILL_DIRS - AISkill CRDs are the planned next source and will map to the same portable manifest shape
Agent-to-agent note:
- MCP is the implemented tool-level interoperability layer today
- ACP has folded into the broader agent-to-agent layer, so Kopilot keeps an async task API and discovery manifest now instead of claiming full protocol support prematurely
-
No hardcoded tools — The LLM constructs any command it needs. The executor middleware provides
run_kubectl,run_helm,run_shell, andread_resourceas generic capabilities. -
Skills = Sub-Agents — Each skill is defined in a YAML file with a domain-specific system prompt and documentation. At runtime, it becomes a fully autonomous ReAct agent that reasons, plans, and executes.
-
Dynamic skill loading — Skills are loaded from:
- Built-in YAML files shipped with the package
- Extra directories (e.g. ConfigMap volumes in K8s)
- AISkill CRDs applied to the cluster
-
Supervisor pattern — A coordinator agent routes user requests to the best sub-agent(s), allowing multi-domain tasks via parallel delegation.
-
Safety-first execution — Every command goes through a safety layer that blocks operations on protected namespaces, requires approval for destructive actions, and audits everything.
-
Self-improving agents — Optional reflection loop evaluates response quality and re-plans if the score is below threshold. Feedback is logged for continuous improvement.
-
Multi-provider LLM — Supports Ollama (local), Gemini, OpenAI, Azure OpenAI, and Anthropic. Switch providers via a single env var.
-
Production observability — Prometheus
/metricsendpoint, structured audit logging, task history, rate limiting, and execution statistics.
Skills are not Python code. They are YAML definitions that instruct an LLM:
name: security
display_name: Security Operations
description: Kubernetes security auditing and hardening
system_prompt: |
You are an expert Kubernetes security engineer.
- Audit RBAC roles and bindings for over-privilege
- Check pod security contexts and standards
- Evaluate NetworkPolicy coverage
- Run compliance checks against CIS Benchmarks
...
documentation: |
# Quick Reference
- List RBAC: kubectl get roles,rolebindings,clusterroles -A
- Pod security: kubectl get pods -A -o jsonpath='{...securityContext}'
...| Skill | Domain | What it can do |
|---|---|---|
security |
Security & Compliance | RBAC audit, pod security, network policies, CIS benchmarks |
administration |
Cluster Admin | Deployments, scaling, rollouts, Helm, namespaces |
networking |
Network Ops | Services, Ingress, DNS, connectivity, service mesh |
monitoring |
Observability | Resource usage, HPA, events, Prometheus, logging |
troubleshooting |
Incident Response | Root cause analysis, crash diagnostics, scheduling |
cost_optimization |
FinOps | Right-sizing, idle detection, orphaned resources |
Drop a YAML file in skills/builtin/ or mount it at runtime:
name: my_custom_skill
display_name: My Custom Skill
description: Does something specific to my org
system_prompt: |
You are an expert in [domain]. You know how to...
documentation: |
Internal runbook for [domain]...Set SKILL_DIRS=/path/to/extra/skills or mount a ConfigMap.
| Channel | Protocol | Description |
|---|---|---|
| REST API | HTTP/JSON | POST /tasks with natural language prompt |
| Slack | Socket Mode | @mention or DM the bot |
| Webhooks | HTTP/JSON | POST /webhook from ServiceNow, PagerDuty, etc. |
| K8s Events | Watch API | Auto-investigates repeated Warning events |
| CRD | AITask |
Apply a CRD manifest and the operator handles it |
- Protected namespaces:
kube-system,kube-public,kube-node-lease— all destructive operations blocked. - Approval gates: Delete, drain, cordon operations require explicit approval before execution.
- Risk assessment: Every command is classified (LOW / MEDIUM / HIGH / CRITICAL) before execution.
- Audit trail: Structured logs of every command, delegation, and result.
- Blocked patterns:
rm -rf /,dd,mkfsand similar are hard-blocked.
- Python 3.11+ (tested on 3.13)
- One of: Ollama (local), Gemini API key, or OpenAI API key
- kubectl configured with cluster access
- Optional: Podman or Docker for container builds
cd kopilot
pip install -e ".[dev]"
cp .env.example .env # Edit with your LLM provider settings
kopilot serve --port 8080export LLM_PROVIDER=gemini
export GEMINI_API_KEY=your-api-key
export GEMINI_MODEL=gemini-2.5-flash
kopilot serveollama pull qwen3:8b
export LLM_PROVIDER=ollama
export LLM_MODEL=qwen3:8b
kopilot servecurl -X POST http://localhost:8080/tasks \
-H "Content-Type: application/json" \
-d '{"prompt": "List all nodes and check their health"}'curl -X POST http://localhost:8080/tasks \
-H "Content-Type: application/json" \
-d '{"prompt": "Audit RBAC for over-privileged accounts", "reflect": true}'kopilot ask "What pods are running in kube-system?"All settings via environment variables or .env file:
| Variable | Default | Description |
|---|---|---|
LLM_PROVIDER |
ollama |
ollama, gemini, openai, azure_openai, anthropic |
LLM_MODEL |
gpt-oss:20b |
Model name |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama API endpoint |
GEMINI_API_KEY |
(empty) | Google Gemini API key |
GEMINI_MODEL |
gemini-2.5-flash |
Gemini model name |
SAFETY_PROTECTED_NAMESPACES |
kube-system,... |
JSON list |
SAFETY_REQUIRE_APPROVAL_DESTRUCTIVE |
true |
Gate destructive ops |
SAFETY_MAX_CONCURRENT_TASKS |
5 |
Concurrent task limit |
ENABLED_SKILLS |
all 6 built-ins | JSON list of skill names |
SKILL_DIRS |
(empty) | Extra dirs (;-separated on Windows, :-separated on Linux) |
LOG_FORMAT |
json |
json or console |
METRICS_ENABLED |
true |
Enable /metrics endpoint |
METRICS_PORT |
9090 |
Prometheus metrics port |
helm install kubedevaiops ./helm/kubedevaiops \
--set llm.ollamaUrl=http://ollama:11434 \
--set llm.model="gpt-oss:20b"kubectl apply -f helm/kubedevaiops/crds/
kubectl apply -f deploy/quickstart.yamlapiVersion: kubedevaiops.io/v1alpha1
kind: AITask
metadata:
name: security-audit
spec:
prompt: "Audit RBAC and pod security across all namespaces"| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check with skills count and LLM provider |
/readyz |
GET | Readiness probe |
/skills |
GET | List loaded skills |
/tasks |
POST | Submit a task (accepts prompt, reflect, namespace) |
/tasks/history |
GET | Recent task history (configurable ?limit=) |
/webhook |
POST | Webhook handler for external integrations |
/metrics |
GET | Prometheus-compatible metrics |
make dev # Install with dev dependencies (editable)
make test # Run pytest with coverage
make lint # Run ruff linter
make format # Auto-format code
make run # Start the full agent locally
make build-podman # Build container image with Podmanpodman machine start
bash scripts/setup-local-k8s.sh
kubectl apply -f helm/kubedevaiops/crds/# Unit tests only (fast, no external deps)
pytest tests/ -v --ignore=tests/test_integration_smoke.py
# Full suite including integration tests (requires Ollama + K8s)
pytest tests/ -v
# End-to-end smoke test with specific provider
python scripts/smoke_e2e.py gemini
python scripts/smoke_e2e.py ollamasrc/kubedevaiops/
├── agent/
│ ├── supervisor.py # Main coordinator with reflection loop
│ ├── subagent.py # Factory: builds agent from YAML skill definition
│ ├── llm.py # Multi-provider LLM factory (Ollama, Gemini, OpenAI, etc.)
│ ├── memory.py # Checkpointing, task context, incident memory
│ └── safety.py # Risk assessment engine
├── executor/
│ └── middleware.py # Generic tools + rate limiting + retries + audit
├── skills/
│ ├── base.py # SkillRegistry (loads + caches sub-agents)
│ ├── loader.py # YAML discovery (builtin + extra dirs, cross-platform)
│ └── builtin/ # 6 YAML skill definitions
├── inputs/
│ ├── api.py # FastAPI REST gateway with /metrics
│ ├── slack.py # Slack bot (Socket Mode)
│ └── k8s_events.py # Warning event auto-investigation with backoff
├── operator/
│ └── handlers.py # Kopf CRD handlers with status conditions
└── outputs/
└── audit.py # Structured audit logging
tests/
├── test_api.py # REST gateway tests
├── test_config.py # Configuration tests (all providers)
├── test_executor.py # Executor middleware tests
├── test_llm.py # LLM factory tests
├── test_memory.py # Memory/checkpointer tests
├── test_middleware.py # Rate limiter, safety, edge cases
├── test_operator.py # Kopf handler tests
├── test_safety.py # Safety guardrail tests
├── test_skills.py # Skill loader tests
├── test_supervisor.py # Supervisor agent tests (mocked LLM)
└── test_integration_smoke.py # Live Ollama + Gemini + K8s tests
| Decision | Choice | Rationale |
|---|---|---|
| Agent framework | LangChain v1+ / LangGraph | Production-grade multi-agent with checkpointing |
| Operator framework | Kopf | Python-native, decorator-based, lightweight |
| Local LLM | Ollama | Easy setup, multi-model, tool-calling support |
| Cloud LLM | Gemini | Good tool-calling, fast, cost-effective |
| Safety | Pre-flight checks | Every command assessed before execution |
| Observability | structlog + Prometheus | Structured JSON logs + metrics endpoint |
| Testing | pytest + pytest-asyncio | Async-first, 75+ tests, 67% coverage |
MIT