AgentOps with Kubernetes

What This Is

A hands-on course on AgentOps — running production-grade AI agents on Kubernetes. Builds on the LLMOps foundation taught in 302-llmops. Students take the LLM serving stack (vLLM, RAG, fine-tuned model) and add agent capabilities: Hermes Agent (NousResearch), MCP tool servers, Kubernetes Agent Sandbox for isolated agent execution, OTEL/Tempo distributed tracing, cost middleware, two-layer guardrails (MCP middleware + Hermes prompt prefix), DeepEval gate in the training pipeline, and a capstone where students ship an insurance_check MCP tool end-to-end through TDD → GitOps → eval gate → ArgoCD → Grafana.

Prerequisite course: 302-llmops v1.0.0+. AgentOps assumes you have a running LLM serving stack on KIND. Take 302 first.

Core Value

Teach practitioners how to operate AI agents on Kubernetes — agent architecture, tool calling via MCP, sandbox isolation, distributed tracing of agent reasoning, cost attribution, eval gates in CI, and guardrails. The bridge between "I can deploy a model" (LLMOps) and "I can deploy an agent that uses that model safely in production" (AgentOps).

Inheritance

This repo is the AgentOps split-out from schoolofdevops/302-llmops v0.19.0. The combined v0.19.0 release shipped LLMOps (Labs 0-6) and AgentOps (Labs 7-13); this repo holds the AgentOps half going forward.

Full history of the original combined course: https://github.com/schoolofdevops/302-llmops/tree/v0.19.0 (tag SHA 3c4e0b120efd93a147d61f916a943e6a775ec717)

See MIGRATION-FROM-302-LLMOPS.md for the full migration dossier.

What Was Validated at v0.19.0 (Phase 3: AgentOps Labs Day 2)

(Copied verbatim from schoolofdevops/302-llmops .planning/PROJECT.md §"Validated in Phase 3: AgentOps Labs Day 2" at SHA 3c4e0b120efd93a147d61f916a943e6a775ec717. Every item below was live-tested on a KIND cluster.)

Hermes Agent (NousResearch v0.12.0) configured for Smile Dental — 3 MCP tool servers (triage, treatment_lookup, book_appointment) + multi-step workflow validated live (Lab 07)
Two-phase LLM strategy — Day 2 labs switch to free-tier API; both Groq (llama-3.3-70b-versatile) and Gemini (gemini-2.5-flash) live-tested
Kubernetes Agent Sandbox v0.4.3 — CRDs installed, agent deployed as Sandbox + SandboxWarmPool (replicas=2) + NetworkPolicy + Sandbox Router gateway (Lab 08)
Cold-vs-warm timing demo — observed warm 7.95s / cold refill 25.03s / cold 2.54s
Agent observability — Grafana Tempo + OTEL Collector deployed; 3 MCP tools auto-instrumented; cost middleware emits agent_llm_tokens_total + agent_llm_cost_usd_total; Grafana dashboard auto-discovered (Lab 09)
D-18 partial compliance documented honestly: tool/retriever spans hierarchical; Hermes-internal agent.request/llm.completion not visible (closed binary)

Inherited Key Decisions

(Verbatim from schoolofdevops/302-llmops .planning/PROJECT.md Key Decisions table — AgentOps-relevant rows only. LOCKED unless explicitly revisited.)

Decision	Rationale	Status
Kubernetes Agent Sandbox for agentic module	First-class K8s primitive for agent workloads — new, differentiated, production-relevant	Inherited
Agent framework: Hermes Agent	NousResearch/hermes-agent — model-agnostic, lightweight ($5 VPS), 40+ tools, MCP support, Docker sandbox built-in, MIT licensed, 47k stars. Configure and deploy, don't build from scratch.	Inherited
No LangGraph/CrewAI	Over-abstracted Pythonic frameworks are dated. Hermes is the modern approach — self-improving, persistent memory, multi-platform.	Inherited
Two-phase LLM strategy	Labs 00-05 use local SmolLM2-135M (LLMOps focus). Labs 06+ switch to free-tier API (Gemini/Groq) for agentic capabilities — local 135M model can't do tool-calling reliably.	Inherited
Support both Gemini and Groq	Abstract behind OpenAI-compatible API so students can use either free-tier provider	Inherited
Move AgentOps to schoolofdevops/303-agentops	Companion course builds on LLMOps foundation; separate repo isolates dependencies and sequencing.	This repo
Drop eval gate from Argo Workflows pipeline (in 302)	Eval gating is contextually agentic. LLMOps pipeline teaches orchestration; eval lives here.	This repo carries the eval gate

Known Issues Inherited

D-18 partial compliance: Hermes is a closed binary (NousResearch/hermes-agent v0.12.0). Tool spans and retriever spans are visible in Grafana Tempo (auto-instrumented via OTEL). Hermes-internal agent.request and llm.completion spans are NOT visible because instrumentation hooks are not exposed. Documented in Lab 09. Workaround paths (custom Hermes build, alternative agent runtime) deferred.

Constraints

Hardware: Same 16GB-RAM CPU-only KIND budget as 302-llmops.
Platform: Same macOS + Windows + Linux requirement.
Free-tier LLM API: Either Groq or Gemini free tier. Students must not be required to pay.
Naming: Smile Dental (inherited from 302-llmops, globally accessible branding).

Active

(v0.1.0 — to be defined via REQUIREMENTS.md when this repo's first milestone is planned. The v0.19.0 baseline above is the inherited foundation; new milestones build on top.)

Out of Scope (inherited from 302-llmops split decision)

LLMOps content (data pipelines, RAG, LoRA fine-tuning, OCI packaging, plain vLLM serving, Prometheus/Grafana for vLLM, autoscaling for vLLM, GitOps for vLLM, Argo Workflows training pipeline without eval gate). Lives in 302-llmops.

Phase 3 validation date: pre-2026-05-07 (combined v0.19.0 release) This repo bootstrapped from 302-llmops v0.19.0 on 2026-05-07 — see MIGRATION-FROM-302-LLMOPS.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AgentOps with Kubernetes

What This Is

Core Value

Inheritance

What Was Validated at v0.19.0 (Phase 3: AgentOps Labs Day 2)

Inherited Key Decisions

Known Issues Inherited

Constraints

Active

Out of Scope (inherited from 302-llmops split decision)

FilesExpand file tree

PROJECT.md

Latest commit

History

PROJECT.md

File metadata and controls

AgentOps with Kubernetes

What This Is

Core Value

Inheritance

What Was Validated at v0.19.0 (Phase 3: AgentOps Labs Day 2)

Inherited Key Decisions

Known Issues Inherited

Constraints

Active

Out of Scope (inherited from 302-llmops split decision)