Skip to content

syn-zhu/agentic-platform

Repository files navigation

Agentic Platform

A production-grade, multi-tenant Kubernetes platform for deploying, managing, and securing AI agents at scale. Built on AWS EKS with Istio Ambient Mesh, it provides enterprise-level isolation, observability, and governance for agentic workloads.

Architecture Overview

The platform combines two open-source projects — AgentGateway (data plane) and kagent (control plane) — with Istio Ambient Mesh, Keycloak, and a suite of Kyverno policies to form a complete multi-tenant agent runtime.

Architecture

Key Design Decisions

  • Istio Ambient Mesh — Zero-sidecar service mesh. L4 mTLS via ztunnel on every node; L7 policy via per-tenant waypoint proxies. No sidecar injection overhead.
  • AgentGateway as Waypoint — The waypoint proxy implementation is AgentGateway (not Envoy), giving waypoints native understanding of MCP, A2A, and LLM protocols.
  • Credential Injection at Gateway — Tenants never hold real LLM API keys. Dummy secrets pass LiteLLM validation; the gateway strips and injects real credentials at the proxy layer.
  • Kyverno-Driven Conventions — Policies auto-generate HTTPRoutes from annotations, inject node scheduling, and enforce trace propagation. Tenants opt in with platform.agentic.io/expose: "true".

Components

Component Role Namespace
AgentGateway L7 proxy — routes MCP, A2A, and LLM traffic; enforces auth policies; injects credentials; traces to Langfuse agentgateway-system
kagent K8s-native agent controller — reconciles Agent, ModelConfig, MCPServer CRDs into running workloads kagent-system
Istio (Ambient) mTLS, SPIFFE identity, ztunnel (L4), waypoint enrollment istio-system
Keycloak Identity provider — JWT issuance, tenant claims keycloak
Langfuse LLM observability — OTEL trace ingestion, prompt/completion logging, cost tracking langfuse
ClickHouse Analytics database backing Langfuse langfuse
Prometheus + Grafana Metrics collection, dashboards, alerting monitoring
Kyverno Policy engine — 5 mutation/generation policies for scheduling, routing, and protocol detection kyverno
OTEL Collector Bridges agent gRPC OTLP to Langfuse HTTP OTLP endpoint langfuse
EverMemOS Long-term memory system — REST API for memory storage, retrieval, and hybrid search (BM25 + vector + reranker). Backed by MongoDB, Elasticsearch, Milvus, and Redis evermemos
Cluster Autoscaler Automatically scales EKS node groups when pods are unschedulable due to resource pressure kube-system

Node Groups

The EKS cluster uses three tainted node groups to isolate workload types:

Node Group Instance Taint Workloads
platform t3.large (2-6) — (untainted, default landing zone) Controllers, Langfuse, Prometheus, Keycloak, Kyverno, EverMemOS
agents t3.large (1-5) workload=agents:NoSchedule Tenant agent pods, MCP servers, waypoint proxies
gateway t3.medium (1-2) workload=gateway:NoSchedule AgentGateway ingress proxy (NLB-backed)

Kyverno policies automatically inject the correct nodeSelector and tolerations — platform operators and tenants don't need to specify them. The Cluster Autoscaler monitors all three ASGs and adds/removes nodes based on pending pod scheduling pressure (scale-down threshold: 50% utilization, 10-minute cooldown).

Multi-Tenancy

Tenant isolation is enforced in depth across five layers:

  1. Namespace — Each tenant gets a dedicated namespace with resource quotas and limit ranges
  2. RBACtenant-agent-developer ClusterRole scoped per-namespace via RoleBinding
  3. NetworkPolicy — Deny-by-default at the CNI layer; explicit allow for DNS, OTEL, gateway, and external HTTPS
  4. Istio AuthorizationPolicy — L4 isolation via SPIFFE identity through ztunnel (independent of NetworkPolicy)
  5. Per-Tenant Waypoint — Each namespace gets its own AgentGateway waypoint for L7 policy (prompt guards, credential injection)

Tenants can:

  • Deploy Agents, MCPServers, and Sandboxes in their namespace
  • Bring their own LLM keys (BYOK pattern via AgentgatewayBackend)
  • Expose agents via annotation — Kyverno auto-generates HTTPRoutes
  • Call agents in other namespaces through the central gateway

Directory Structure

.
├── cluster/          # EKS cluster definition (eksctl) and IAM policies
├── platform/         # Helmfile, Helm values, Kubernetes manifests, and custom images
│   ├── helmfile.yaml       # Orchestrates ~12 Helm releases in phased order
│   ├── environments/       # Per-environment config (dev, defaults)
│   ├── images/             # Custom container images
│   ├── manifests/          # Post-install K8s manifests (policies, routes, RBAC)
│   └── values/             # Helm chart value overrides
├── scripts/          # Numbered deployment pipeline (00-05) and utilities
├── tenants/          # Tenant templates, examples, and onboarded tenant configs
│   ├── _template/          # Boilerplate YAML for tenant onboarding
│   ├── examples/           # Sample agent, MCP server, sandbox, and backend configs
│   └── onboarded/          # Live tenant directories (alpha, beta, test-a2a)
├── tests/            # Kyverno unit tests and Chainsaw e2e integration tests
│   ├── kyverno/            # 5 policy test suites
│   └── e2e/                # 2 Chainsaw test suites (passthrough, egress)
└── vendor/           # Patched forks of agentgateway, kagent, and evermemos

See individual directory READMEs for detailed documentation.

Getting Started

Prerequisites

  • AWS CLI configured with appropriate permissions
  • eksctl, kubectl, helm, helmfile installed
  • kyverno CLI (for running policy tests)
  • chainsaw (for e2e tests)
  • An Anthropic API key
  • An OpenRouter API key (for EverMemOS LLM)
  • A DeepInfra API key (for EverMemOS embedding/reranking)

Deployment Pipeline

The scripts in scripts/ are numbered in execution order:

# 1. Create the EKS cluster (15-20 min)
./scripts/00-create-cluster.sh

# 2. Provision AWS resources (RDS PostgreSQL, ElastiCache Redis, S3)
./scripts/01-create-aws-resources.sh

# 3. Create Kubernetes secrets from AWS outputs
./scripts/02-create-secrets.sh

# 4. Deploy all platform services via Helmfile
./scripts/03-deploy-platform.sh

# 5. Apply post-install manifests (Kyverno policies, Grafana MCP, Agent Sandbox CRDs)
./scripts/04-post-install.sh

# 6. Onboard a tenant
./scripts/05-onboard-tenant.sh alpha

Accessing Platform UIs

./scripts/port-forward.sh
Service URL
kagent UI http://localhost:15000
Langfuse http://localhost:15001
Grafana http://localhost:15002
AgentGateway http://localhost:15003
Keycloak http://localhost:15004
Kiali http://localhost:15005

Quick Reapply

After editing manifests, reapply without a full redeploy:

./scripts/apply.sh              # Full reapply (Helmfile + manifests)
./scripts/apply.sh --skip-helm  # Manifests only

Routing

All agent, LLM, and tool traffic flows through the AgentGateway proxy:

Path Pattern Target Protocol
/a2a/{namespace}/{agent-name} Agent pods A2A (Google Agent-to-Agent)
/llm/{namespace}/{model} LLM backends (Anthropic, OpenAI, etc.) HTTP (with credential injection)
/mcp/{namespace}/{server} MCP tool servers MCP (Model Context Protocol)
/sandbox/* Dynamic sandbox router HTTP (header-based routing)

Routes are auto-generated by Kyverno when resources are annotated with platform.agentic.io/expose: "true". The optional platform.agentic.io/expose-alias annotation overrides the namespace segment in the path.

Observability

Agent interactions are traced end-to-end (shown in the architecture diagram above). Langfuse captures prompt/completion pairs, token usage, latency, and cost. Traces follow W3C traceparent headers across agent-to-agent calls (kagent instruments both httpx and aiohttp via OpenTelemetry for context propagation).

Prometheus scrapes metrics from all namespaces. Grafana dashboards are auto-discovered, and a Grafana MCP tool server is available to kagent's built-in observability agent.

Testing

# Run all tests
./tests/run-all.sh

# Unit tests only (Kyverno policy validation, no cluster required)
./tests/run-all.sh --unit-only

# Integration tests only (requires running cluster)
./tests/run-all.sh --integration-only

Kyverno Policy Tests (Unit)

  • platform-scheduling — Injects platform node affinity
  • tenant-scheduling — Injects agent node affinity
  • auto-expose — Generates HTTPRoutes from annotations

Chainsaw E2E Tests (Integration)

  • waypoint-passthrough — Verifies intra-namespace traffic routes through waypoint
  • waypoint-egress — Tests external LLM API calls with TLS origination and credential injection

Vendored Dependencies

The vendor/ directory contains patched forks of three projects:

  • agentgateway (Rust) — Patched for Istio Ambient waypoint integration and HBONE protocol support. Branch: waypoint.
  • kagent (Go/Python/TypeScript) — Patched for trace context propagation (traceparent/tracestate). Natively supports appProtocol: kgateway.dev/a2a on agent Services and OTEL instrumentation of aiohttp/httpx.
  • evermemos (Python) — Patched to remove hard dependency on .env file, allowing configuration entirely via Kubernetes ConfigMap and Secret. Image built from vendor source and pushed to ECR.

All are upstream-tracking forks. Changes are scoped to features not yet merged upstream.

External Dependencies (AWS)

Service Purpose Configuration
RDS PostgreSQL 17 Keycloak + Langfuse database db.t4g.small, encrypted, 20GB gp3
ElastiCache Redis Langfuse session/cache cache.t4g.small, transit encryption
S3 Langfuse event/media uploads agentic-platform-langfuse-{env}, AES256, no public access
EBS (gp3) Persistent volumes for ClickHouse, Prometheus, Grafana, EverMemOS (MongoDB, Elasticsearch, Milvus, etcd, MinIO) Provisioned via EBS CSI driver

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors