Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,12 @@ kagenti-operator/config/crd/bases/_.yaml
*.log
debug
Dockerfile.cross

# spex: generated/local files
.claude/skills/
.claude/settings.json
.claude/settings.local.json
.specify/*
!.specify/memory/
.specify/memory/*
!.specify/memory/constitution.md
142 changes: 142 additions & 0 deletions .specify/memory/constitution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
<!--
Sync Impact Report
- Version change: 0.0.0 → 1.0.0
- Added principles:
- I. Reconciler Status Integrity
- II. Spec-Anchored Testing
- III. Controller-Runtime Safety
- IV. CRD-First Design
- V. Feature-Gated Rollout
- Added sections:
- Controller-Runtime Gotchas
- Quality Gates
- Governance
- Templates requiring updates:
- .specify/templates/plan-template.md: ✅ no changes needed (Constitution Check section exists)
- .specify/templates/spec-template.md: ✅ no changes needed (acceptance scenario format compatible)
- .specify/templates/tasks-template.md: ✅ no changes needed (task format compatible)
- Follow-up TODOs: none
-->

# Kagenti Operator Constitution

## Core Principles

### I. Reconciler Status Integrity

In-memory status mutations MUST survive all API server interactions within
a single reconcile cycle. Any call that refreshes the reconciled object
from the API server (Patch, Update on the main resource) MUST save and
restore in-memory status to prevent silent data loss.

Rationale: `client.Patch()` and `client.Update()` replace the local object
with the API server response, wiping unpersisted in-memory status changes.
This caused a production bug where `status.card` and all conditions
disappeared despite successful card fetches. The bug passed code review
and 180 unit tests because the Patch call failed silently in test
environments where the object didn't exist in the API server.

### II. Spec-Anchored Testing

Tests MUST verify outcomes using the same method the spec's acceptance
scenario describes. If the spec says "confirm via `kubectl get`", the test
MUST read the object back from the API server (envtest), not inspect
in-memory state. Tests that only check in-memory state after a function
call may pass when API server interactions fail silently, hiding bugs that
only manifest in production.

Rationale: The card discovery bug was invisible to tests because
`r.Patch()` returned NotFound (object not in envtest), the error was
logged but not returned, and the in-memory state appeared correct. A test
that read back from the API server would have caught the discrepancy
immediately.

### III. Controller-Runtime Safety

All reconciler code MUST follow these controller-runtime rules:

- Never call `r.Patch()` or `r.Update()` on the main resource between
in-memory status mutations and `Status().Update()`. If unavoidable,
save and restore `rt.Status` across the call.
- `Status().Update()` only persists the status subresource. Metadata
changes (labels, annotations) require a separate Patch on the main
resource.
- When mixing metadata patches and status updates, be aware that the
metadata patch refreshes the object and invalidates in-memory status.
- HTTP fetches to in-cluster Services during reconciliation MUST have
timeouts (10s default) to prevent blocking the work queue.

Rationale: These are controller-runtime framework behaviors that are not
obvious from the API surface. They have caused production bugs in this
project and are documented here to prevent recurrence.

### IV. CRD-First Design

CRD schemas MUST be the source of truth for the operator's data model.
Status fields MUST use concrete types with explicit JSON tags, not
unstructured maps. All status fields MUST be validated against the
deployed CRD schema, not just against Go compilation.

Rationale: A Kubernetes operator's contract is its CRD. Schema mismatches
between code and CRD silently drop fields during API server round-trips.

### V. Feature-Gated Rollout

New controller behaviors that modify workload state or add API server
interactions MUST be gated behind a CLI flag (disabled by default). The
flag MUST be documented in the Helm chart values. Existing behavior MUST
NOT change when the flag is disabled.

Rationale: Operators run in shared clusters. Ungated behavior changes
risk disrupting workloads during upgrades. Feature flags allow gradual
rollout and easy rollback.

## Controller-Runtime Gotchas

This section documents framework-specific behaviors that are not obvious
from the API surface and have caused bugs in this project. Review agents
and developers MUST consult this section when reviewing reconciler code.

1. **Patch/Update refreshes the object**: `client.Patch(ctx, obj, patch)`
and `client.Update(ctx, obj)` replace `obj` with the API server
response. Any in-memory mutations not yet persisted are lost.

2. **Status is a separate subresource**: `Status().Update()` only writes
status fields. Metadata (annotations, labels) requires a separate
main-resource Patch.

3. **Single worker queue**: By default, controller-runtime uses one
worker per controller. A blocking HTTP call (e.g., card fetch with no
timeout) blocks all reconciliation for that controller.

4. **envtest vs production divergence**: Operations that fail silently in
envtest (e.g., Patch on a non-existent object) may succeed in
production with different side effects. Tests MUST create objects in
envtest before exercising code paths that interact with the API server.

## Quality Gates

- All reconciler tests MUST create the reconciled object in envtest and
read it back after the operation under test (Principle II).
- Deep review findings that trigger auto-fixes MUST be followed by a
regression check: does the fix preserve all previously-passing behavior?
- Card discovery and other HTTP-dependent features MUST be tested with
stub fetchers that return controlled data, AND with envtest objects
that trigger the full Patch/Status().Update cycle.

## Governance

This constitution governs all development on the kagenti-operator. It
supersedes informal conventions and ad-hoc patterns.

**Amendment process**: Propose changes via PR. Changes to principles
require a rationale with a concrete example (bug, incident, or design
decision) that motivated the change. Version increments follow semver:
MAJOR for principle removals or redefinitions, MINOR for new principles
or sections, PATCH for clarifications.

**Compliance**: All PRs and code reviews MUST verify compliance with
these principles. The deep review agents receive this constitution as
context and MUST flag violations.

**Version**: 1.0.0 | **Ratified**: 2026-05-22 | **Last Amended**: 2026-05-22
106 changes: 106 additions & 0 deletions brainstorm/02-identity-binding-migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Brainstorm: Identity Binding Migration from AgentCard to AgentRuntime

**Date:** 2026-05-21
**Status:** active
**Triggered by:** PR #372 review feedback (pdettori)

## Problem Framing

AgentCard's `spec.identityBinding` is the only admin-authored policy field on the AgentCard CRD. It controls two things:

1. **Trust domain scoping** (`trustDomain`): overrides the operator-level `--spire-trust-domain` for a specific agent workload. The controller checks whether the agent's SPIFFE ID (from JWS signature or mTLS peer cert) belongs to this trust domain.

2. **Strict enforcement** (`strict`): when true, a binding failure removes the `signature-verified` label from the workload, triggering the NetworkPolicy controller to apply a restrictive policy that isolates the agent.

Identity binding is orthogonal to card discovery. The card fetch is the mechanism that surfaces the SPIFFE ID, but the binding evaluation and enforcement are about the workload's identity posture, not the card content. Moving card data into AgentRuntime status does not affect binding behavior.

## Current State

```
AgentCard.spec.identityBinding
├── trustDomain: string (per-agent override of --spire-trust-domain)
└── strict: bool (default: false)
├── false → binding results recorded in status only, no enforcement
└── true → binding failure removes signature-verified label
→ NetworkPolicy controller applies restrictive policy
```

AgentRuntime already has a related field:

```
AgentRuntime.spec.identity.spiffe.trustDomain
```

This field serves a different purpose today (configuring the workload's own SVID trust domain for injection), but the naming and placement overlap with identity binding's trust domain scoping.

## Migration Path

### Phase 1: Card data into AgentRuntime status (this PR, #372)

- `status.card` surfaces card data, fetch metadata, and verification results
- Identity binding stays on AgentCard
- AgentCard controller continues all enforcement (label propagation, NetworkPolicy)
- No behavior changes for existing identity binding users

### Phase 2: Identity binding into AgentRuntime spec (future PR)

Move `identityBinding` to AgentRuntime.spec, alongside the existing identity fields:

```yaml
apiVersion: agent.kagenti.dev/v1alpha1
kind: AgentRuntime
spec:
identity:
spiffe:
trustDomain: example.org # existing: workload SVID trust domain
binding: # new: migrated from AgentCard
trustDomain: example.org # override for binding evaluation
strict: false # enforcement toggle
```

Open question: should `identity.spiffe.trustDomain` and `identity.binding.trustDomain` be unified? They serve related but different purposes (SVID injection vs binding evaluation). If unified, a single `trustDomain` at the `identity` level could serve both.

### Phase 3: Enforcement migration (future PR)

Move label propagation logic (`signature-verified` label) from AgentCard controller to AgentRuntime controller. The AgentRuntime controller already manages workload labels and annotations, so this is a natural fit.

### Phase 4: AgentCard CRD removal

Once identity binding and enforcement have migrated:
- Remove AgentCard CRD
- Remove AgentCardSyncReconciler (auto-creates AgentCards for labelled workloads)
- Remove AgentCard controller
- Clean up RBAC, webhooks, and test fixtures

## Design Considerations

### Enforcement during coexistence

During the transition (Phases 1-2), enforcement continues via the AgentCard controller. Both CRDs coexist. Operators who use identity binding today see no behavior change.

After Phase 2, the AgentRuntime controller evaluates binding from its own spec fields. The AgentCard controller's binding logic becomes dead code and is removed in Phase 4.

### AgentCardSyncReconciler

The sync controller auto-creates AgentCards for workloads with `kagenti.io/type=agent` labels. During coexistence, it continues to function. After Phase 2, the sync controller is no longer needed because:
- Card discovery is handled by AgentRuntime controller (Phase 1)
- Identity binding is on AgentRuntime spec (Phase 2)
- There is no remaining reason to auto-create AgentCards

### Trust domain field unification

Three options:

| Option | Structure | Pros | Cons |
|--------|-----------|------|------|
| A: Separate fields | `identity.spiffe.trustDomain` + `identity.binding.trustDomain` | Clear separation of concerns | Confusing duplication |
| B: Unified field | `identity.trustDomain` (serves both) | Simple, one source of truth | Loses granularity if they diverge |
| C: Binding inherits | `identity.binding.trustDomain` defaults to `identity.spiffe.trustDomain` if unset | Best of both, explicit override | Slightly more complex defaulting |

Option C seems best: the common case is a single trust domain, but the override is available when needed.

## Open Questions

- Should binding evaluation in the AgentRuntime controller use the verification data already in `status.card` (from Phase 1), or should it re-evaluate independently?
- Is there a need for namespace-level binding policy (via ConfigMap), analogous to `authbridge-runtime-config`?
- Should the `AgentCardSyncReconciler` get its own deprecation warning before removal?
Loading
Loading