A hands-on course on AgentOps — running production-grade AI agents on Kubernetes. Builds on the LLMOps foundation taught in 302-llmops. Students take the LLM serving stack (vLLM, RAG, fine-tuned model) and add agent capabilities: Hermes Agent (NousResearch), MCP tool servers, Kubernetes Agent Sandbox for isolated agent execution, OTEL/Tempo distributed tracing, cost middleware, two-layer guardrails (MCP middleware + Hermes prompt prefix), DeepEval gate in the training pipeline, and a capstone where students ship an insurance_check MCP tool end-to-end through TDD → GitOps → eval gate → ArgoCD → Grafana.
Prerequisite course: 302-llmops v1.0.0+. AgentOps assumes you have a running LLM serving stack on KIND. Take 302 first.
Teach practitioners how to operate AI agents on Kubernetes — agent architecture, tool calling via MCP, sandbox isolation, distributed tracing of agent reasoning, cost attribution, eval gates in CI, and guardrails. The bridge between "I can deploy a model" (LLMOps) and "I can deploy an agent that uses that model safely in production" (AgentOps).
This repo is the AgentOps split-out from schoolofdevops/302-llmops v0.19.0. The combined v0.19.0 release shipped LLMOps (Labs 0-6) and AgentOps (Labs 7-13); this repo holds the AgentOps half going forward.
Full history of the original combined course: https://github.com/schoolofdevops/302-llmops/tree/v0.19.0 (tag SHA 3c4e0b120efd93a147d61f916a943e6a775ec717)
See MIGRATION-FROM-302-LLMOPS.md for the full migration dossier.
(Copied verbatim from schoolofdevops/302-llmops .planning/PROJECT.md §"Validated in Phase 3: AgentOps Labs Day 2" at SHA 3c4e0b120efd93a147d61f916a943e6a775ec717. Every item below was live-tested on a KIND cluster.)
- Hermes Agent (NousResearch v0.12.0) configured for Smile Dental — 3 MCP tool servers (triage, treatment_lookup, book_appointment) + multi-step workflow validated live (Lab 07)
- Two-phase LLM strategy — Day 2 labs switch to free-tier API; both Groq (
llama-3.3-70b-versatile) and Gemini (gemini-2.5-flash) live-tested - Kubernetes Agent Sandbox v0.4.3 — CRDs installed, agent deployed as Sandbox + SandboxWarmPool (replicas=2) + NetworkPolicy + Sandbox Router gateway (Lab 08)
- Cold-vs-warm timing demo — observed warm 7.95s / cold refill 25.03s / cold 2.54s
- Agent observability — Grafana Tempo + OTEL Collector deployed; 3 MCP tools auto-instrumented; cost middleware emits
agent_llm_tokens_total+agent_llm_cost_usd_total; Grafana dashboard auto-discovered (Lab 09) - D-18 partial compliance documented honestly: tool/retriever spans hierarchical; Hermes-internal
agent.request/llm.completionnot visible (closed binary)
(Verbatim from schoolofdevops/302-llmops .planning/PROJECT.md Key Decisions table — AgentOps-relevant rows only. LOCKED unless explicitly revisited.)
| Decision | Rationale | Status |
|---|---|---|
| Kubernetes Agent Sandbox for agentic module | First-class K8s primitive for agent workloads — new, differentiated, production-relevant | Inherited |
| Agent framework: Hermes Agent | NousResearch/hermes-agent — model-agnostic, lightweight ($5 VPS), 40+ tools, MCP support, Docker sandbox built-in, MIT licensed, 47k stars. Configure and deploy, don't build from scratch. | Inherited |
| No LangGraph/CrewAI | Over-abstracted Pythonic frameworks are dated. Hermes is the modern approach — self-improving, persistent memory, multi-platform. | Inherited |
| Two-phase LLM strategy | Labs 00-05 use local SmolLM2-135M (LLMOps focus). Labs 06+ switch to free-tier API (Gemini/Groq) for agentic capabilities — local 135M model can't do tool-calling reliably. | Inherited |
| Support both Gemini and Groq | Abstract behind OpenAI-compatible API so students can use either free-tier provider | Inherited |
| Move AgentOps to schoolofdevops/303-agentops | Companion course builds on LLMOps foundation; separate repo isolates dependencies and sequencing. | This repo |
| Drop eval gate from Argo Workflows pipeline (in 302) | Eval gating is contextually agentic. LLMOps pipeline teaches orchestration; eval lives here. | This repo carries the eval gate |
- D-18 partial compliance: Hermes is a closed binary (NousResearch/hermes-agent v0.12.0). Tool spans and retriever spans are visible in Grafana Tempo (auto-instrumented via OTEL). Hermes-internal
agent.requestandllm.completionspans are NOT visible because instrumentation hooks are not exposed. Documented in Lab 09. Workaround paths (custom Hermes build, alternative agent runtime) deferred.
- Hardware: Same 16GB-RAM CPU-only KIND budget as 302-llmops.
- Platform: Same macOS + Windows + Linux requirement.
- Free-tier LLM API: Either Groq or Gemini free tier. Students must not be required to pay.
- Naming: Smile Dental (inherited from 302-llmops, globally accessible branding).
(v0.1.0 — to be defined via REQUIREMENTS.md when this repo's first milestone is planned. The v0.19.0 baseline above is the inherited foundation; new milestones build on top.)
- LLMOps content (data pipelines, RAG, LoRA fine-tuning, OCI packaging, plain vLLM serving, Prometheus/Grafana for vLLM, autoscaling for vLLM, GitOps for vLLM, Argo Workflows training pipeline without eval gate). Lives in 302-llmops.
Phase 3 validation date: pre-2026-05-07 (combined v0.19.0 release) This repo bootstrapped from 302-llmops v0.19.0 on 2026-05-07 — see MIGRATION-FROM-302-LLMOPS.md