PoC: Kagenti integration for agent lifecycle management

## Summary

Refactor agentstack so that agent scaling, deployment, and discovery are handled by **kagenti** instead of our custom Kubernetes provider management. This gives us standard A2A agent lifecycle, realtime agent card discovery, and a path to zero-trust identity (SPIRE/SPIFFE), service mesh (Istio Ambient), and Shipwright-based builds.

### What We Gain
- Standard A2A agent lifecycle management (deploy, discover, scale)
- Realtime agent card discovery (no more storing cards in DB or Docker labels)
- Zero-trust identity via SPIRE/SPIFFE (optional)
- Service mesh observability via Istio Ambient (optional)
- Shipwright-based builds replacing our Kaniko pipeline (optional)
- Team namespace isolation

### What We Drop
- `KubernetesProviderDeploymentManager` (custom kr8s-based deployment logic)
- `KubernetesProviderBuildManager` (Kaniko build jobs with Docker label baking)
- Agent card storage in database / Docker image labels
- Scale-to-zero logic (kagenti handles agent lifecycle)
- The concept of "managed" vs "unmanaged" providers (all agents become kagenti-managed)

---

## Target Architecture

```
Lima VM (MicroShift)
├── agentstack namespace
│   ├── agentstack-server (FastAPI, slimmed down)
│   │   ├── A2A Proxy → delegates to kagenti agent services
│   │   └── Provider Registry → reads from kagenti API / agent cards
│   ├── PostgreSQL
│   ├── Redis
│   └── SeaweedFS
├── keycloak namespace (shared)
│   └── Keycloak (StatefulSet)
├── kagenti-system namespace
│   ├── kagenti-operator
│   ├── kagenti-webhook
│   ├── kagenti-ui (backend + frontend) [optional]
│   └── MCP Gateway
├── team1, team2, ... (agent namespaces)
│   └── agent Deployments + Services
├── istio-system (optional)
├── zero-trust-workload-identity-manager (optional)
└── cr-system (container registry, optional)
```

---

## Key Decisions

### Installing Kagenti into MicroShift: Helm-only

Strip kagenti down to its two Helm charts (`kagenti` and `kagenti-deps`) and install them directly into MicroShift. Skip the Ansible playbook entirely.

What Ansible does that we replicate elsewhere:
1. Cluster creation → already handled by Lima/MicroShift
2. DNS setup → already handled by Lima networking
3. OAuth secret creation → Helm hooks or init containers
4. Image preloading → pre-pull in Lima config
5. Shipwright ClusterBuildStrategy → Helm template or post-install hook

### Configurable Feature Toggles

```yaml
kagenti:
  enabled: true
  features:
    istio:
      enabled: false        # Service mesh (Ambient mode)
    spire:
      enabled: false        # Zero-trust workload identity
    shipwright:
      enabled: false        # Container builds
    builds:
      enabled: false        # Build UI + Tekton
    observability:
      phoenix:
        enabled: true       # LLM trace viewer
      otel:
        enabled: false      # OpenTelemetry collector
      kiali:
        enabled: false      # Service mesh dashboard
    containerRegistry:
      enabled: false        # In-cluster registry
    certManager:
      enabled: false        # Certificate management
    mcpGateway:
      enabled: false        # MCP Gateway
```

Minimal local setup: just kagenti operator + webhook + Keycloak. Full-featured: everything enabled.

### Keycloak: kagenti-deps owns it, agentstack consumes it

- kagenti-deps deploys Keycloak in `keycloak` namespace
- Agentstack Helm chart disables its own Keycloak and references the external instance
- Deployment order: `kagenti-deps` → `agentstack` → `kagenti`

### Keycloak Realms: Separate

- Agentstack keeps its `agentstack` realm
- Kagenti keeps its `kagenti` realm
- Both realms in the same Keycloak instance, fully independent
- Agentstack's provision-job.yaml stays in agentstack Helm chart, targets `keycloak-service.keycloak:8080`
- Admin credentials shared via K8s secret or dedicated admin client

### Agent Discovery: Kagenti Backend API

Agentstack calls kagenti's REST API to discover agents:

```
agentstack-server → HTTP → kagenti-backend → K8s API → Deployments with kagenti.io/type=agent
```

Agentstack-server gets a dedicated Keycloak client in the `kagenti` realm (e.g., `agentstack-api`) with `kagenti-viewer` role, using client credentials grant for service-to-service auth.

Direct K8s label scanning was rejected because K8s API RBAC is namespace-scoped - agentstack's service account can't list Deployments in team namespaces without ClusterRole escalation. The kagenti API avoids this entirely.

### Multi-Tenancy: Global Agent Catalog + Per-User Data

All agents across all kagenti namespaces are visible to all agentstack users. User data (conversations, tasks, files) remains per-user.

```
Agents:        GLOBAL (all users see all agents from all namespaces)
Conversations: PER-USER (user's own chat history with agents)
Tasks:         PER-USER (A2A task ownership)
Files:         PER-USER (uploaded documents)
```

### Multi-Agent Communication: All through agentstack proxy

Agentstack issues custom tokens and controls which agent can call which, so it stays in the request path. Cross-namespace networking works fine in K8s (no architectural blockers).

```
# Agent (team1) → agentstack API (agentstack namespace) ← works
# Agentstack (agentstack namespace) → agent (team1)     ← works
```

Namespaces are a logical boundary for resource organization, not a network boundary. Istio Ambient adds mTLS on top without restricting connectivity.

### DNS

Adopt kagenti's `localtest.me` convention (wildcard DNS → 127.0.0.1) for Keycloak redirect URIs, agent URLs, and UI access.

---

## Component Mapping

### What agentstack drops (delegates to kagenti)

| Agentstack Component | Kagenti Replacement |
|---|---|
| `KubernetesProviderDeploymentManager` | Kagenti operator deploys agents as standard K8s Deployments in team namespaces |
| `KubernetesProviderBuildManager` (Kaniko) | Shipwright + Tekton builds (optional) |
| Agent card in Docker labels (`beeai.dev.agent.json`) | Realtime HTTP fetch from `/.well-known/agent-card.json` |
| Agent card stored in DB | Realtime discovery from running agents |
| Scale-to-zero / auto-stop logic | Kagenti manages agent lifecycle (or standard HPA) |
| Provider model (managed/unmanaged distinction) | All agents are kagenti-managed Deployments |
| `build-provider-job.yaml` (Kaniko + Crane) | Shipwright BuildRun with Buildah strategy |
| Keycloak deployment | kagenti-deps Keycloak deployment |

### What agentstack keeps

| Component | Reason |
|---|---|
| A2A Proxy Service | Core routing/auth logic, user task tracking |
| Provider Registry sync | Can evolve to sync with kagenti's agent namespaces |
| PostgreSQL | Agentstack's own data (users, tasks, conversations) |
| Redis | Caching, rate limiting |
| SeaweedFS | Object storage for artifacts |
| Phoenix | LLM observability |

### What changes in agentstack-server

| Area | Change |
|---|---|
| `bootstrap.py` | Remove `KubernetesProviderDeploymentManager` and `KubernetesProviderBuildManager` injection |
| `providers.py` service | Rewrite to discover agents via kagenti API |
| `a2a.py` service | Update URL resolution: `http://{agent}.{namespace}.svc.cluster.local:8080` |
| Provider model | Simplify - remove `auto_stop_timeout`, `unmanaged_state`, build fields |
| Provider cron jobs | Remove `auto_stop_providers`, `refresh_unmanaged_provider_state`; keep or adapt registry sync |
| Configuration | Add kagenti connection settings, remove build/scaling config |

---

## Agent Card Discovery - New Model

**Current:** Build-time baking into Docker labels → stored in DB → DB lookup at runtime

**New:** Runtime HTTP fetch from `/.well-known/agent-card.json` via kagenti API. No more "offline" agents. Discovery is always fresh.

**Transition:** Keep DB-backed card cache as fallback during PoC, add kagenti-based discovery as primary, remove DB cache once stable.

---

## MicroShift Compatibility Notes

- Kagenti supports OpenShift via `global.openshift: true`
- MicroShift may lack OLM → need Helm-based alternatives for operators
- SPIRE's ZTWIM operator requires OCP 4.19+ → use `useSpireHelmChart: true`
- Istio Ambient mode should work on MicroShift (standard K8s networking)
- Adopt `localtest.me` for DNS

---

## Implementation Plan

### Phase 1: Minimal Integration
- [ ] Install kagenti Helm charts into MicroShift (operator + webhook only)
- [ ] Move Keycloak to separate namespace
- [ ] Deploy a test agent via kagenti (manual kubectl)
- [ ] Verify agent card discovery via HTTP
- [ ] Update agentstack A2A proxy to route to kagenti-managed agents

### Phase 2: Feature Parity
- [ ] Remove `KubernetesProviderDeploymentManager`
- [ ] Remove `KubernetesProviderBuildManager`
- [ ] Implement kagenti-based agent discovery in provider service
- [ ] Update Helm chart (remove Keycloak, add kagenti-deps dependency)
- [ ] Test full agent lifecycle: deploy → discover → chat → delete

### Phase 3: Optional Features
- [ ] Enable Istio Ambient mode
- [ ] Enable SPIRE/SPIFFE identity
- [ ] Enable Shipwright builds
- [ ] Configure feature toggles in values.yaml

---

## Open Questions

1. **Kagenti operator on MicroShift**: Has this been tested? Any known issues?
2. **Agent namespaces**: Should agentstack create team namespaces, or delegate to kagenti?
3. **UI**: Do we keep agentstack UI only, or also deploy kagenti UI for agent management?
4. **MCP Gateway**: Agentstack has managed MCP service. Kagenti also has MCP Gateway. Which wins?
5. **Provider registry**: Current agentstack syncs from Git-based registries. Does kagenti have an equivalent, or do we keep this?
6. **Observability**: Both use Phoenix. Consolidate to one instance?

---

Design document: `docs/poc-kagenti-integration.md` on branch `poc/kagenti-integration`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC: Kagenti integration for agent lifecycle management #2304

Summary

What We Gain

What We Drop

Target Architecture

Key Decisions

Installing Kagenti into MicroShift: Helm-only

Configurable Feature Toggles

Keycloak: kagenti-deps owns it, agentstack consumes it

Keycloak Realms: Separate

Agent Discovery: Kagenti Backend API

Multi-Tenancy: Global Agent Catalog + Per-User Data

Multi-Agent Communication: All through agentstack proxy

DNS

Component Mapping

What agentstack drops (delegates to kagenti)

What agentstack keeps

What changes in agentstack-server

Agent Card Discovery - New Model

MicroShift Compatibility Notes

Implementation Plan

Phase 1: Minimal Integration

Phase 2: Feature Parity

Phase 3: Optional Features

Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Agentstack Component	Kagenti Replacement
`KubernetesProviderDeploymentManager`	Kagenti operator deploys agents as standard K8s Deployments in team namespaces
`KubernetesProviderBuildManager` (Kaniko)	Shipwright + Tekton builds (optional)
Agent card in Docker labels (`beeai.dev.agent.json`)	Realtime HTTP fetch from `/.well-known/agent-card.json`
Agent card stored in DB	Realtime discovery from running agents
Scale-to-zero / auto-stop logic	Kagenti manages agent lifecycle (or standard HPA)
Provider model (managed/unmanaged distinction)	All agents are kagenti-managed Deployments
`build-provider-job.yaml` (Kaniko + Crane)	Shipwright BuildRun with Buildah strategy
Keycloak deployment	kagenti-deps Keycloak deployment

Component	Reason
A2A Proxy Service	Core routing/auth logic, user task tracking
Provider Registry sync	Can evolve to sync with kagenti's agent namespaces
PostgreSQL	Agentstack's own data (users, tasks, conversations)
Redis	Caching, rate limiting
SeaweedFS	Object storage for artifacts
Phoenix	LLM observability

Area	Change
`bootstrap.py`	Remove `KubernetesProviderDeploymentManager` and `KubernetesProviderBuildManager` injection
`providers.py` service	Rewrite to discover agents via kagenti API
`a2a.py` service	Update URL resolution: `http://{agent}.{namespace}.svc.cluster.local:8080`
Provider model	Simplify - remove `auto_stop_timeout`, `unmanaged_state`, build fields
Provider cron jobs	Remove `auto_stop_providers`, `refresh_unmanaged_provider_state`; keep or adapt registry sync
Configuration	Add kagenti connection settings, remove build/scaling config

PoC: Kagenti integration for agent lifecycle management #2304

Description

Summary

What We Gain

What We Drop

Target Architecture

Key Decisions

Installing Kagenti into MicroShift: Helm-only

Configurable Feature Toggles

Keycloak: kagenti-deps owns it, agentstack consumes it

Keycloak Realms: Separate

Agent Discovery: Kagenti Backend API

Multi-Tenancy: Global Agent Catalog + Per-User Data

Multi-Agent Communication: All through agentstack proxy

DNS

Component Mapping

What agentstack drops (delegates to kagenti)

What agentstack keeps

What changes in agentstack-server

Agent Card Discovery - New Model

MicroShift Compatibility Notes

Implementation Plan

Phase 1: Minimal Integration

Phase 2: Feature Parity

Phase 3: Optional Features

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions