Kopilot

Find wasted Kubernetes spend before it lands on the bill.

Kopilot is an approval-gated AI operator for Kubernetes teams. It investigates over-provisioned workloads, idle resources, and orphaned storage from one prompt, explains the evidence in plain English, and recommends the next safe action.

Public CLI note: install the Python package as kubedevaiops, then run the branded kopilot command. Internal package paths and Kubernetes API groups still use kubedevaiops during the transition.

Built for

Platform and DevOps teams managing noisy multi-namespace clusters
Operators who want self-hosted automation with audit logs
Teams that want recommendations first and approval before risky cleanup

30-Second Demo

pip install kubedevaiops
kopilot ask "Find over-provisioned deployments and orphaned PVCs across all namespaces"

What the first run should give you:

Evidence from cluster metrics and live object inspection
Plain-English explanation of where waste is coming from
Rightsizing and cleanup recommendations, with destructive follow-up left behind approval gates

Safety Model

Read-first by default: investigations start with get, describe, logs, and metrics collection before any mutation is considered.
Approval-gated changes: destructive actions such as deletes, drains, and cordons require explicit approval.
Protected namespaces: high-risk namespaces are blocked from destructive actions.
Audit trail: every delegated task and executed command is recorded.
Self-hosted posture: the core deployment model assumes you control the cluster access path.

Cost Optimization Proof Pack

Use case	Prompt	Expected outcome
Rightsizing workloads	`Find over-provisioned deployments across all namespaces`	Highlights requests vs observed usage and suggests safer request targets
Idle resource review	`Show me resources that look idle and worth review`	Flags low-usage workloads without auto-deleting them
Orphaned storage	`List PVCs that are not attached to running workloads`	Surfaces old storage objects with namespace and age context
Safe follow-up plan	`Recommend the next cost-saving actions for staging`	Produces an ordered plan that still keeps destructive cleanup approval-gated

Project Surfaces

AI-Native Interop

Kopilot now exposes one concrete interop surface and two discovery surfaces:

kopilot mcp runs an MCP server for external agent clients
GET /skills/portable exposes enabled skills as portable manifests
GET /.well-known/agent-manifest.json exposes an async-first discovery document

kopilot mcp --transport stdio

Current skill model:

Built-in YAML skills ship with the package
Mounted YAML skills can be loaded through SKILL_DIRS
AISkill CRDs are the planned next source and will map to the same portable manifest shape

Agent-to-agent note:

MCP is the implemented tool-level interoperability layer today
ACP has folded into the broader agent-to-agent layer, so Kopilot keeps an async task API and discovery manifest now instead of claiming full protocol support prematurely

Key Architecture Principles

No hardcoded tools — The LLM constructs any command it needs. The executor middleware provides run_kubectl, run_helm, run_shell, and read_resource as generic capabilities.
Skills = Sub-Agents — Each skill is defined in a YAML file with a domain-specific system prompt and documentation. At runtime, it becomes a fully autonomous ReAct agent that reasons, plans, and executes.
Dynamic skill loading — Skills are loaded from:
- Built-in YAML files shipped with the package
- Extra directories (e.g. ConfigMap volumes in K8s)
- AISkill CRDs applied to the cluster
Supervisor pattern — A coordinator agent routes user requests to the best sub-agent(s), allowing multi-domain tasks via parallel delegation.
Safety-first execution — Every command goes through a safety layer that blocks operations on protected namespaces, requires approval for destructive actions, and audits everything.
Self-improving agents — Optional reflection loop evaluates response quality and re-plans if the score is below threshold. Feedback is logged for continuous improvement.
Multi-provider LLM — Supports Ollama (local), Gemini, OpenAI, Azure OpenAI, and Anthropic. Switch providers via a single env var.
Production observability — Prometheus /metrics endpoint, structured audit logging, task history, rate limiting, and execution statistics.

Skill Definitions (YAML)

Skills are not Python code. They are YAML definitions that instruct an LLM:

name: security
display_name: Security Operations
description: Kubernetes security auditing and hardening

system_prompt: |
  You are an expert Kubernetes security engineer.
  - Audit RBAC roles and bindings for over-privilege
  - Check pod security contexts and standards
  - Evaluate NetworkPolicy coverage
  - Run compliance checks against CIS Benchmarks
  ...

documentation: |
  # Quick Reference
  - List RBAC: kubectl get roles,rolebindings,clusterroles -A
  - Pod security: kubectl get pods -A -o jsonpath='{...securityContext}'
  ...

Built-in Skills

Skill	Domain	What it can do
`security`	Security & Compliance	RBAC audit, pod security, network policies, CIS benchmarks
`administration`	Cluster Admin	Deployments, scaling, rollouts, Helm, namespaces
`networking`	Network Ops	Services, Ingress, DNS, connectivity, service mesh
`monitoring`	Observability	Resource usage, HPA, events, Prometheus, logging
`troubleshooting`	Incident Response	Root cause analysis, crash diagnostics, scheduling
`cost_optimization`	FinOps	Right-sizing, idle detection, orphaned resources

Adding Custom Skills

Drop a YAML file in skills/builtin/ or mount it at runtime:

name: my_custom_skill
display_name: My Custom Skill
description: Does something specific to my org
system_prompt: |
  You are an expert in [domain]. You know how to...
documentation: |
  Internal runbook for [domain]...

Set SKILL_DIRS=/path/to/extra/skills or mount a ConfigMap.

Input Channels

Channel	Protocol	Description
REST API	HTTP/JSON	`POST /tasks` with natural language prompt
Slack	Socket Mode	`@mention` or DM the bot
Webhooks	HTTP/JSON	`POST /webhook` from ServiceNow, PagerDuty, etc.
K8s Events	Watch API	Auto-investigates repeated Warning events
CRD	`AITask`	Apply a CRD manifest and the operator handles it

Safety & Guardrails

Protected namespaces: kube-system, kube-public, kube-node-lease — all destructive operations blocked.
Approval gates: Delete, drain, cordon operations require explicit approval before execution.
Risk assessment: Every command is classified (LOW / MEDIUM / HIGH / CRITICAL) before execution.
Audit trail: Structured logs of every command, delegation, and result.
Blocked patterns: rm -rf /, dd, mkfs and similar are hard-blocked.

Quick Start

Prerequisites

Python 3.11+ (tested on 3.13)
One of: Ollama (local), Gemini API key, or OpenAI API key
kubectl configured with cluster access
Optional: Podman or Docker for container builds

Install & Run

cd kopilot
pip install -e ".[dev]"
cp .env.example .env    # Edit with your LLM provider settings
kopilot serve --port 8080

With Gemini

export LLM_PROVIDER=gemini
export GEMINI_API_KEY=your-api-key
export GEMINI_MODEL=gemini-2.5-flash
kopilot serve

With local Ollama

ollama pull qwen3:8b
export LLM_PROVIDER=ollama
export LLM_MODEL=qwen3:8b
kopilot serve

Submit a Task

curl -X POST http://localhost:8080/tasks \
  -H "Content-Type: application/json" \
  -d '{"prompt": "List all nodes and check their health"}'

Submit with Reflection (self-improving)

curl -X POST http://localhost:8080/tasks \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Audit RBAC for over-privileged accounts", "reflect": true}'

One-shot CLI

kopilot ask "What pods are running in kube-system?"

Configuration

All settings via environment variables or .env file:

Variable	Default	Description
`LLM_PROVIDER`	`ollama`	`ollama`, `gemini`, `openai`, `azure_openai`, `anthropic`
`LLM_MODEL`	`gpt-oss:20b`	Model name
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama API endpoint
`GEMINI_API_KEY`	(empty)	Google Gemini API key
`GEMINI_MODEL`	`gemini-2.5-flash`	Gemini model name
`SAFETY_PROTECTED_NAMESPACES`	`kube-system,...`	JSON list
`SAFETY_REQUIRE_APPROVAL_DESTRUCTIVE`	`true`	Gate destructive ops
`SAFETY_MAX_CONCURRENT_TASKS`	`5`	Concurrent task limit
`ENABLED_SKILLS`	all 6 built-ins	JSON list of skill names
`SKILL_DIRS`	(empty)	Extra dirs (`;`-separated on Windows, `:`-separated on Linux)
`LOG_FORMAT`	`json`	`json` or `console`
`METRICS_ENABLED`	`true`	Enable `/metrics` endpoint
`METRICS_PORT`	`9090`	Prometheus metrics port

Deployment

Helm Chart

helm install kubedevaiops ./helm/kubedevaiops \
  --set llm.ollamaUrl=http://ollama:11434 \
  --set llm.model="gpt-oss:20b"

Direct Kubernetes

kubectl apply -f helm/kubedevaiops/crds/
kubectl apply -f deploy/quickstart.yaml

Custom Resource

apiVersion: kubedevaiops.io/v1alpha1
kind: AITask
metadata:
  name: security-audit
spec:
  prompt: "Audit RBAC and pod security across all namespaces"

API Reference

Endpoint	Method	Description
`/health`	GET	Health check with skills count and LLM provider
`/readyz`	GET	Readiness probe
`/skills`	GET	List loaded skills
`/tasks`	POST	Submit a task (accepts `prompt`, `reflect`, `namespace`)
`/tasks/history`	GET	Recent task history (configurable `?limit=`)
`/webhook`	POST	Webhook handler for external integrations
`/metrics`	GET	Prometheus-compatible metrics

Development

make dev                  # Install with dev dependencies (editable)
make test                 # Run pytest with coverage
make lint                 # Run ruff linter
make format               # Auto-format code
make run                  # Start the full agent locally
make build-podman         # Build container image with Podman

Local K8s cluster (Podman + Kind)

podman machine start
bash scripts/setup-local-k8s.sh
kubectl apply -f helm/kubedevaiops/crds/

Running Tests

# Unit tests only (fast, no external deps)
pytest tests/ -v --ignore=tests/test_integration_smoke.py

# Full suite including integration tests (requires Ollama + K8s)
pytest tests/ -v

# End-to-end smoke test with specific provider
python scripts/smoke_e2e.py gemini
python scripts/smoke_e2e.py ollama

Project Structure

src/kubedevaiops/
├── agent/
│   ├── supervisor.py    # Main coordinator with reflection loop
│   ├── subagent.py      # Factory: builds agent from YAML skill definition
│   ├── llm.py           # Multi-provider LLM factory (Ollama, Gemini, OpenAI, etc.)
│   ├── memory.py        # Checkpointing, task context, incident memory
│   └── safety.py        # Risk assessment engine
├── executor/
│   └── middleware.py     # Generic tools + rate limiting + retries + audit
├── skills/
│   ├── base.py          # SkillRegistry (loads + caches sub-agents)
│   ├── loader.py        # YAML discovery (builtin + extra dirs, cross-platform)
│   └── builtin/         # 6 YAML skill definitions
├── inputs/
│   ├── api.py           # FastAPI REST gateway with /metrics
│   ├── slack.py         # Slack bot (Socket Mode)
│   └── k8s_events.py    # Warning event auto-investigation with backoff
├── operator/
│   └── handlers.py      # Kopf CRD handlers with status conditions
└── outputs/
    └── audit.py         # Structured audit logging

tests/
├── test_api.py           # REST gateway tests
├── test_config.py        # Configuration tests (all providers)
├── test_executor.py      # Executor middleware tests
├── test_llm.py           # LLM factory tests
├── test_memory.py        # Memory/checkpointer tests
├── test_middleware.py     # Rate limiter, safety, edge cases
├── test_operator.py      # Kopf handler tests
├── test_safety.py        # Safety guardrail tests
├── test_skills.py        # Skill loader tests
├── test_supervisor.py    # Supervisor agent tests (mocked LLM)
└── test_integration_smoke.py  # Live Ollama + Gemini + K8s tests

Architecture Decisions

Decision	Choice	Rationale
Agent framework	LangChain v1+ / LangGraph	Production-grade multi-agent with checkpointing
Operator framework	Kopf	Python-native, decorator-based, lightweight
Local LLM	Ollama	Easy setup, multi-model, tool-calling support
Cloud LLM	Gemini	Good tool-calling, fast, cost-effective
Safety	Pre-flight checks	Every command assessed before execution
Observability	structlog + Prometheus	Structured JSON logs + metrics endpoint
Testing	pytest + pytest-asyncio	Async-first, 75+ tests, 67% coverage

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
deploy		deploy
docs		docs
helm/kubedevaiops		helm/kubedevaiops
scripts		scripts
src/kubedevaiops		src/kubedevaiops
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Kopilot

30-Second Demo

Safety Model

Cost Optimization Proof Pack

Project Surfaces

AI-Native Interop

Key Architecture Principles

Skill Definitions (YAML)

Built-in Skills

Adding Custom Skills

Input Channels

Safety & Guardrails

Quick Start

Prerequisites

Install & Run

With Gemini

With local Ollama

Submit a Task

Submit with Reflection (self-improving)

One-shot CLI

Configuration

Deployment

Helm Chart

Direct Kubernetes

Custom Resource

API Reference

Development

Local K8s cluster (Podman + Kind)

Running Tests

Project Structure

Architecture Decisions

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages