kagenti · pdettori · May 25, 2026 · May 21, 2026 · May 21, 2026 · May 21, 2026
@@ -55,3 +55,12 @@ kagenti-operator/config/crd/bases/_.yaml
 *.log
 debug
 Dockerfile.cross
+
+# spex: generated/local files
+.claude/skills/
+.claude/settings.json
+.claude/settings.local.json
+.specify/*
+!.specify/memory/
+.specify/memory/*
+!.specify/memory/constitution.md
@@ -0,0 +1,142 @@
+<!--
+Sync Impact Report
+- Version change: 0.0.0 → 1.0.0
+- Added principles:
+  - I. Reconciler Status Integrity
+  - II. Spec-Anchored Testing
+  - III. Controller-Runtime Safety
+  - IV. CRD-First Design
+  - V. Feature-Gated Rollout
+- Added sections:
+  - Controller-Runtime Gotchas
+  - Quality Gates
+  - Governance
+- Templates requiring updates:
+  - .specify/templates/plan-template.md: ✅ no changes needed (Constitution Check section exists)
+  - .specify/templates/spec-template.md: ✅ no changes needed (acceptance scenario format compatible)
+  - .specify/templates/tasks-template.md: ✅ no changes needed (task format compatible)
+- Follow-up TODOs: none
+-->
+
+# Kagenti Operator Constitution
+
+## Core Principles
+
+### I. Reconciler Status Integrity
+
+In-memory status mutations MUST survive all API server interactions within
+a single reconcile cycle. Any call that refreshes the reconciled object
+from the API server (Patch, Update on the main resource) MUST save and
+restore in-memory status to prevent silent data loss.
+
+Rationale: `client.Patch()` and `client.Update()` replace the local object
+with the API server response, wiping unpersisted in-memory status changes.
+This caused a production bug where `status.card` and all conditions
+disappeared despite successful card fetches. The bug passed code review
+and 180 unit tests because the Patch call failed silently in test
+environments where the object didn't exist in the API server.
+
+### II. Spec-Anchored Testing
+
+Tests MUST verify outcomes using the same method the spec's acceptance
+scenario describes. If the spec says "confirm via `kubectl get`", the test
+MUST read the object back from the API server (envtest), not inspect
+in-memory state. Tests that only check in-memory state after a function
+call may pass when API server interactions fail silently, hiding bugs that
+only manifest in production.
+
+Rationale: The card discovery bug was invisible to tests because
+`r.Patch()` returned NotFound (object not in envtest), the error was
+logged but not returned, and the in-memory state appeared correct. A test
+that read back from the API server would have caught the discrepancy
+immediately.
+
+### III. Controller-Runtime Safety
+
+All reconciler code MUST follow these controller-runtime rules:
+
+- Never call `r.Patch()` or `r.Update()` on the main resource between
+  in-memory status mutations and `Status().Update()`. If unavoidable,
+  save and restore `rt.Status` across the call.
+- `Status().Update()` only persists the status subresource. Metadata
+  changes (labels, annotations) require a separate Patch on the main
+  resource.
+- When mixing metadata patches and status updates, be aware that the
+  metadata patch refreshes the object and invalidates in-memory status.
+- HTTP fetches to in-cluster Services during reconciliation MUST have
+  timeouts (10s default) to prevent blocking the work queue.
+
+Rationale: These are controller-runtime framework behaviors that are not
+obvious from the API surface. They have caused production bugs in this
+project and are documented here to prevent recurrence.
+
+### IV. CRD-First Design
+
+CRD schemas MUST be the source of truth for the operator's data model.
+Status fields MUST use concrete types with explicit JSON tags, not
+unstructured maps. All status fields MUST be validated against the
+deployed CRD schema, not just against Go compilation.
+
+Rationale: A Kubernetes operator's contract is its CRD. Schema mismatches
+between code and CRD silently drop fields during API server round-trips.
+
+### V. Feature-Gated Rollout
+
+New controller behaviors that modify workload state or add API server
+interactions MUST be gated behind a CLI flag (disabled by default). The
+flag MUST be documented in the Helm chart values. Existing behavior MUST
+NOT change when the flag is disabled.
+
+Rationale: Operators run in shared clusters. Ungated behavior changes
+risk disrupting workloads during upgrades. Feature flags allow gradual
+rollout and easy rollback.
+
+## Controller-Runtime Gotchas
+
+This section documents framework-specific behaviors that are not obvious
+from the API surface and have caused bugs in this project. Review agents
+and developers MUST consult this section when reviewing reconciler code.
+
+1. **Patch/Update refreshes the object**: `client.Patch(ctx, obj, patch)`
+   and `client.Update(ctx, obj)` replace `obj` with the API server
+   response. Any in-memory mutations not yet persisted are lost.
+
+2. **Status is a separate subresource**: `Status().Update()` only writes
+   status fields. Metadata (annotations, labels) requires a separate
+   main-resource Patch.
+
+3. **Single worker queue**: By default, controller-runtime uses one
+   worker per controller. A blocking HTTP call (e.g., card fetch with no
+   timeout) blocks all reconciliation for that controller.
+
+4. **envtest vs production divergence**: Operations that fail silently in
+   envtest (e.g., Patch on a non-existent object) may succeed in
+   production with different side effects. Tests MUST create objects in
+   envtest before exercising code paths that interact with the API server.
+
+## Quality Gates
+
+- All reconciler tests MUST create the reconciled object in envtest and
+  read it back after the operation under test (Principle II).
+- Deep review findings that trigger auto-fixes MUST be followed by a
+  regression check: does the fix preserve all previously-passing behavior?
+- Card discovery and other HTTP-dependent features MUST be tested with
+  stub fetchers that return controlled data, AND with envtest objects
+  that trigger the full Patch/Status().Update cycle.
+
+## Governance
+
+This constitution governs all development on the kagenti-operator. It
+supersedes informal conventions and ad-hoc patterns.
+
+**Amendment process**: Propose changes via PR. Changes to principles
+require a rationale with a concrete example (bug, incident, or design
+decision) that motivated the change. Version increments follow semver:
+MAJOR for principle removals or redefinitions, MINOR for new principles
+or sections, PATCH for clarifications.
+
+**Compliance**: All PRs and code reviews MUST verify compliance with
+these principles. The deep review agents receive this constitution as
+context and MUST flag violations.
+
+**Version**: 1.0.0 | **Ratified**: 2026-05-22 | **Last Amended**: 2026-05-22
@@ -0,0 +1,106 @@
+# Brainstorm: Identity Binding Migration from AgentCard to AgentRuntime
+
+**Date:** 2026-05-21
+**Status:** active
+**Triggered by:** PR #372 review feedback (pdettori)
+
+## Problem Framing
+
+AgentCard's `spec.identityBinding` is the only admin-authored policy field on the AgentCard CRD. It controls two things:
+
+1. **Trust domain scoping** (`trustDomain`): overrides the operator-level `--spire-trust-domain` for a specific agent workload. The controller checks whether the agent's SPIFFE ID (from JWS signature or mTLS peer cert) belongs to this trust domain.
+
+2. **Strict enforcement** (`strict`): when true, a binding failure removes the `signature-verified` label from the workload, triggering the NetworkPolicy controller to apply a restrictive policy that isolates the agent.
+
+Identity binding is orthogonal to card discovery. The card fetch is the mechanism that surfaces the SPIFFE ID, but the binding evaluation and enforcement are about the workload's identity posture, not the card content. Moving card data into AgentRuntime status does not affect binding behavior.
+
+## Current State
+
+```
+AgentCard.spec.identityBinding
+├── trustDomain: string (per-agent override of --spire-trust-domain)
+└── strict: bool (default: false)
+    ├── false → binding results recorded in status only, no enforcement
+    └── true → binding failure removes signature-verified label
+                → NetworkPolicy controller applies restrictive policy
+```
+
+AgentRuntime already has a related field:
+
+```
+AgentRuntime.spec.identity.spiffe.trustDomain
+```
+
+This field serves a different purpose today (configuring the workload's own SVID trust domain for injection), but the naming and placement overlap with identity binding's trust domain scoping.
+
+## Migration Path
+
+### Phase 1: Card data into AgentRuntime status (this PR, #372)
+
+- `status.card` surfaces card data, fetch metadata, and verification results
+- Identity binding stays on AgentCard
+- AgentCard controller continues all enforcement (label propagation, NetworkPolicy)
+- No behavior changes for existing identity binding users
+
+### Phase 2: Identity binding into AgentRuntime spec (future PR)
+
+Move `identityBinding` to AgentRuntime.spec, alongside the existing identity fields:
+
+```yaml
+apiVersion: agent.kagenti.dev/v1alpha1
+kind: AgentRuntime
+spec:
+  identity:
+    spiffe:
+      trustDomain: example.org    # existing: workload SVID trust domain
+    binding:                       # new: migrated from AgentCard
+      trustDomain: example.org     # override for binding evaluation
+      strict: false                # enforcement toggle
+```
+
+Open question: should `identity.spiffe.trustDomain` and `identity.binding.trustDomain` be unified? They serve related but different purposes (SVID injection vs binding evaluation). If unified, a single `trustDomain` at the `identity` level could serve both.
+
+### Phase 3: Enforcement migration (future PR)
+
+Move label propagation logic (`signature-verified` label) from AgentCard controller to AgentRuntime controller. The AgentRuntime controller already manages workload labels and annotations, so this is a natural fit.
+
+### Phase 4: AgentCard CRD removal
+
+Once identity binding and enforcement have migrated:
+- Remove AgentCard CRD
+- Remove AgentCardSyncReconciler (auto-creates AgentCards for labelled workloads)
+- Remove AgentCard controller
+- Clean up RBAC, webhooks, and test fixtures
+
+## Design Considerations
+
+### Enforcement during coexistence
+
+During the transition (Phases 1-2), enforcement continues via the AgentCard controller. Both CRDs coexist. Operators who use identity binding today see no behavior change.
+
+After Phase 2, the AgentRuntime controller evaluates binding from its own spec fields. The AgentCard controller's binding logic becomes dead code and is removed in Phase 4.
+
+### AgentCardSyncReconciler
+
+The sync controller auto-creates AgentCards for workloads with `kagenti.io/type=agent` labels. During coexistence, it continues to function. After Phase 2, the sync controller is no longer needed because:
+- Card discovery is handled by AgentRuntime controller (Phase 1)
+- Identity binding is on AgentRuntime spec (Phase 2)
+- There is no remaining reason to auto-create AgentCards
+
+### Trust domain field unification
+
+Three options:
+
+| Option | Structure | Pros | Cons |
+|--------|-----------|------|------|
+| A: Separate fields | `identity.spiffe.trustDomain` + `identity.binding.trustDomain` | Clear separation of concerns | Confusing duplication |
+| B: Unified field | `identity.trustDomain` (serves both) | Simple, one source of truth | Loses granularity if they diverge |
+| C: Binding inherits | `identity.binding.trustDomain` defaults to `identity.spiffe.trustDomain` if unset | Best of both, explicit override | Slightly more complex defaulting |
+
+Option C seems best: the common case is a single trust domain, but the override is available when needed.
+
+## Open Questions
+
+- Should binding evaluation in the AgentRuntime controller use the verification data already in `status.card` (from Phase 1), or should it re-evaluate independently?
+- Is there a need for namespace-level binding policy (via ConfigMap), analogous to `authbridge-runtime-config`?
+- Should the `AgentCardSyncReconciler` get its own deprecation warning before removal?