diff --git a/.gitignore b/.gitignore index d1bcaedd..eecd4e0e 100644 --- a/.gitignore +++ b/.gitignore @@ -55,3 +55,12 @@ kagenti-operator/config/crd/bases/_.yaml *.log debug Dockerfile.cross + +# spex: generated/local files +.claude/skills/ +.claude/settings.json +.claude/settings.local.json +.specify/* +!.specify/memory/ +.specify/memory/* +!.specify/memory/constitution.md diff --git a/.specify/memory/constitution.md b/.specify/memory/constitution.md new file mode 100644 index 00000000..41e2dc36 --- /dev/null +++ b/.specify/memory/constitution.md @@ -0,0 +1,142 @@ + + +# Kagenti Operator Constitution + +## Core Principles + +### I. Reconciler Status Integrity + +In-memory status mutations MUST survive all API server interactions within +a single reconcile cycle. Any call that refreshes the reconciled object +from the API server (Patch, Update on the main resource) MUST save and +restore in-memory status to prevent silent data loss. + +Rationale: `client.Patch()` and `client.Update()` replace the local object +with the API server response, wiping unpersisted in-memory status changes. +This caused a production bug where `status.card` and all conditions +disappeared despite successful card fetches. The bug passed code review +and 180 unit tests because the Patch call failed silently in test +environments where the object didn't exist in the API server. + +### II. Spec-Anchored Testing + +Tests MUST verify outcomes using the same method the spec's acceptance +scenario describes. If the spec says "confirm via `kubectl get`", the test +MUST read the object back from the API server (envtest), not inspect +in-memory state. Tests that only check in-memory state after a function +call may pass when API server interactions fail silently, hiding bugs that +only manifest in production. + +Rationale: The card discovery bug was invisible to tests because +`r.Patch()` returned NotFound (object not in envtest), the error was +logged but not returned, and the in-memory state appeared correct. A test +that read back from the API server would have caught the discrepancy +immediately. + +### III. Controller-Runtime Safety + +All reconciler code MUST follow these controller-runtime rules: + +- Never call `r.Patch()` or `r.Update()` on the main resource between + in-memory status mutations and `Status().Update()`. If unavoidable, + save and restore `rt.Status` across the call. +- `Status().Update()` only persists the status subresource. Metadata + changes (labels, annotations) require a separate Patch on the main + resource. +- When mixing metadata patches and status updates, be aware that the + metadata patch refreshes the object and invalidates in-memory status. +- HTTP fetches to in-cluster Services during reconciliation MUST have + timeouts (10s default) to prevent blocking the work queue. + +Rationale: These are controller-runtime framework behaviors that are not +obvious from the API surface. They have caused production bugs in this +project and are documented here to prevent recurrence. + +### IV. CRD-First Design + +CRD schemas MUST be the source of truth for the operator's data model. +Status fields MUST use concrete types with explicit JSON tags, not +unstructured maps. All status fields MUST be validated against the +deployed CRD schema, not just against Go compilation. + +Rationale: A Kubernetes operator's contract is its CRD. Schema mismatches +between code and CRD silently drop fields during API server round-trips. + +### V. Feature-Gated Rollout + +New controller behaviors that modify workload state or add API server +interactions MUST be gated behind a CLI flag (disabled by default). The +flag MUST be documented in the Helm chart values. Existing behavior MUST +NOT change when the flag is disabled. + +Rationale: Operators run in shared clusters. Ungated behavior changes +risk disrupting workloads during upgrades. Feature flags allow gradual +rollout and easy rollback. + +## Controller-Runtime Gotchas + +This section documents framework-specific behaviors that are not obvious +from the API surface and have caused bugs in this project. Review agents +and developers MUST consult this section when reviewing reconciler code. + +1. **Patch/Update refreshes the object**: `client.Patch(ctx, obj, patch)` + and `client.Update(ctx, obj)` replace `obj` with the API server + response. Any in-memory mutations not yet persisted are lost. + +2. **Status is a separate subresource**: `Status().Update()` only writes + status fields. Metadata (annotations, labels) requires a separate + main-resource Patch. + +3. **Single worker queue**: By default, controller-runtime uses one + worker per controller. A blocking HTTP call (e.g., card fetch with no + timeout) blocks all reconciliation for that controller. + +4. **envtest vs production divergence**: Operations that fail silently in + envtest (e.g., Patch on a non-existent object) may succeed in + production with different side effects. Tests MUST create objects in + envtest before exercising code paths that interact with the API server. + +## Quality Gates + +- All reconciler tests MUST create the reconciled object in envtest and + read it back after the operation under test (Principle II). +- Deep review findings that trigger auto-fixes MUST be followed by a + regression check: does the fix preserve all previously-passing behavior? +- Card discovery and other HTTP-dependent features MUST be tested with + stub fetchers that return controlled data, AND with envtest objects + that trigger the full Patch/Status().Update cycle. + +## Governance + +This constitution governs all development on the kagenti-operator. It +supersedes informal conventions and ad-hoc patterns. + +**Amendment process**: Propose changes via PR. Changes to principles +require a rationale with a concrete example (bug, incident, or design +decision) that motivated the change. Version increments follow semver: +MAJOR for principle removals or redefinitions, MINOR for new principles +or sections, PATCH for clarifications. + +**Compliance**: All PRs and code reviews MUST verify compliance with +these principles. The deep review agents receive this constitution as +context and MUST flag violations. + +**Version**: 1.0.0 | **Ratified**: 2026-05-22 | **Last Amended**: 2026-05-22 diff --git a/brainstorm/02-identity-binding-migration.md b/brainstorm/02-identity-binding-migration.md new file mode 100644 index 00000000..f20035a6 --- /dev/null +++ b/brainstorm/02-identity-binding-migration.md @@ -0,0 +1,106 @@ +# Brainstorm: Identity Binding Migration from AgentCard to AgentRuntime + +**Date:** 2026-05-21 +**Status:** active +**Triggered by:** PR #372 review feedback (pdettori) + +## Problem Framing + +AgentCard's `spec.identityBinding` is the only admin-authored policy field on the AgentCard CRD. It controls two things: + +1. **Trust domain scoping** (`trustDomain`): overrides the operator-level `--spire-trust-domain` for a specific agent workload. The controller checks whether the agent's SPIFFE ID (from JWS signature or mTLS peer cert) belongs to this trust domain. + +2. **Strict enforcement** (`strict`): when true, a binding failure removes the `signature-verified` label from the workload, triggering the NetworkPolicy controller to apply a restrictive policy that isolates the agent. + +Identity binding is orthogonal to card discovery. The card fetch is the mechanism that surfaces the SPIFFE ID, but the binding evaluation and enforcement are about the workload's identity posture, not the card content. Moving card data into AgentRuntime status does not affect binding behavior. + +## Current State + +``` +AgentCard.spec.identityBinding +├── trustDomain: string (per-agent override of --spire-trust-domain) +└── strict: bool (default: false) + ├── false → binding results recorded in status only, no enforcement + └── true → binding failure removes signature-verified label + → NetworkPolicy controller applies restrictive policy +``` + +AgentRuntime already has a related field: + +``` +AgentRuntime.spec.identity.spiffe.trustDomain +``` + +This field serves a different purpose today (configuring the workload's own SVID trust domain for injection), but the naming and placement overlap with identity binding's trust domain scoping. + +## Migration Path + +### Phase 1: Card data into AgentRuntime status (this PR, #372) + +- `status.card` surfaces card data, fetch metadata, and verification results +- Identity binding stays on AgentCard +- AgentCard controller continues all enforcement (label propagation, NetworkPolicy) +- No behavior changes for existing identity binding users + +### Phase 2: Identity binding into AgentRuntime spec (future PR) + +Move `identityBinding` to AgentRuntime.spec, alongside the existing identity fields: + +```yaml +apiVersion: agent.kagenti.dev/v1alpha1 +kind: AgentRuntime +spec: + identity: + spiffe: + trustDomain: example.org # existing: workload SVID trust domain + binding: # new: migrated from AgentCard + trustDomain: example.org # override for binding evaluation + strict: false # enforcement toggle +``` + +Open question: should `identity.spiffe.trustDomain` and `identity.binding.trustDomain` be unified? They serve related but different purposes (SVID injection vs binding evaluation). If unified, a single `trustDomain` at the `identity` level could serve both. + +### Phase 3: Enforcement migration (future PR) + +Move label propagation logic (`signature-verified` label) from AgentCard controller to AgentRuntime controller. The AgentRuntime controller already manages workload labels and annotations, so this is a natural fit. + +### Phase 4: AgentCard CRD removal + +Once identity binding and enforcement have migrated: +- Remove AgentCard CRD +- Remove AgentCardSyncReconciler (auto-creates AgentCards for labelled workloads) +- Remove AgentCard controller +- Clean up RBAC, webhooks, and test fixtures + +## Design Considerations + +### Enforcement during coexistence + +During the transition (Phases 1-2), enforcement continues via the AgentCard controller. Both CRDs coexist. Operators who use identity binding today see no behavior change. + +After Phase 2, the AgentRuntime controller evaluates binding from its own spec fields. The AgentCard controller's binding logic becomes dead code and is removed in Phase 4. + +### AgentCardSyncReconciler + +The sync controller auto-creates AgentCards for workloads with `kagenti.io/type=agent` labels. During coexistence, it continues to function. After Phase 2, the sync controller is no longer needed because: +- Card discovery is handled by AgentRuntime controller (Phase 1) +- Identity binding is on AgentRuntime spec (Phase 2) +- There is no remaining reason to auto-create AgentCards + +### Trust domain field unification + +Three options: + +| Option | Structure | Pros | Cons | +|--------|-----------|------|------| +| A: Separate fields | `identity.spiffe.trustDomain` + `identity.binding.trustDomain` | Clear separation of concerns | Confusing duplication | +| B: Unified field | `identity.trustDomain` (serves both) | Simple, one source of truth | Loses granularity if they diverge | +| C: Binding inherits | `identity.binding.trustDomain` defaults to `identity.spiffe.trustDomain` if unset | Best of both, explicit override | Slightly more complex defaulting | + +Option C seems best: the common case is a single trust domain, but the override is available when needed. + +## Open Questions + +- Should binding evaluation in the AgentRuntime controller use the verification data already in `status.card` (from Phase 1), or should it re-evaluate independently? +- Is there a need for namespace-level binding policy (via ConfigMap), analogous to `authbridge-runtime-config`? +- Should the `AgentCardSyncReconciler` get its own deprecation warning before removal? diff --git a/charts/kagenti-operator/crds/agent.kagenti.dev_agentruntimes.yaml b/charts/kagenti-operator/crds/agent.kagenti.dev_agentruntimes.yaml index 8d5fcb6b..ff811c0b 100644 --- a/charts/kagenti-operator/crds/agent.kagenti.dev_agentruntimes.yaml +++ b/charts/kagenti-operator/crds/agent.kagenti.dev_agentruntimes.yaml @@ -30,6 +30,11 @@ spec: jsonPath: .status.phase name: Phase type: string + - description: Card Sync Status + jsonPath: .status.conditions[?(@.type=='CardSynced')].status + name: CardSynced + priority: 1 + type: string - jsonPath: .metadata.creationTimestamp name: Age type: date @@ -131,6 +136,13 @@ spec: mtls.mode > "disabled". Setting mtlsMode != disabled implicitly requires SPIRE — the operator auto-enables spire for the workload. + CR-empty vs CR="disabled" are observably different in + `kubectl get agentruntime -o yaml` (the former omits the field, + the latter shows mtlsMode: disabled) but produce the same + effective mode: empty falls through to the namespace ConfigMap, + "disabled" is an explicit override that pins mode off even when + the namespace default is non-disabled. + Note: changing mtlsMode triggers a pod rollout because authbridge cannot hot-reload mTLS config (the byte-peek listener is wired at process start). @@ -198,6 +210,220 @@ spec: status: description: AgentRuntimeStatus defines the observed state of AgentRuntime. properties: + card: + description: Card holds A2A agent card data discovered from the workload's + Service endpoint. + properties: + attestedAgentSpiffeID: + description: AttestedAgentSpiffeID is the SPIFFE ID extracted + from the mTLS peer certificate. + type: string + capabilities: + description: The A2A capability set supported by the agent. + properties: + extensions: + description: A list of protocol extensions supported by the + agent. + items: + description: AgentExtension describes an A2A protocol extension + supported by the agent. + properties: + description: + description: A human-readable description of how this + agent uses the extension. + type: string + params: + additionalProperties: + x-kubernetes-preserve-unknown-fields: true + description: Extension-specific configuration parameters. + type: object + required: + description: |- + If true, the client must understand and comply with the extension's + requirements. + type: boolean + uri: + description: The unique URI identifying the extension. + type: string + type: object + type: array + pushNotifications: + description: |- + Indicates if the agent supports sending push notifications for + asynchronous task updates. + type: boolean + streaming: + description: Indicates if the agent supports streaming responses. + type: boolean + type: object + cardId: + description: CardID is a SHA-256 content hash of the fetched card + data. + type: string + defaultInputModes: + description: |- + The set of interaction modes that the agent supports across all skills, + defined as media types. + items: + type: string + type: array + defaultOutputModes: + description: The media types supported as outputs from this agent. + items: + type: string + type: array + description: + description: |- + A human-readable description of the agent, assisting users and other + agents in understanding its purpose. + type: string + documentationUrl: + description: A URL providing additional documentation about the + agent. + type: string + fetchedAt: + description: FetchedAt is the timestamp of the last successful + card fetch. + format: date-time + type: string + iconUrl: + description: A URL to an icon for the agent. + type: string + name: + description: A human-readable name for the agent. + type: string + protocol: + description: Protocol is the detected agent protocol (e.g., "a2a"). + type: string + provider: + description: The service provider of the agent. + properties: + organization: + description: The name of the agent provider's organization. + type: string + url: + description: A URL for the agent provider's website or relevant + documentation. + type: string + type: object + signatureKeyID: + description: SignatureKeyID is the key ID from the verified JWS + header. + type: string + signatureVerificationDetails: + description: SignatureVerificationDetails contains details or + errors from signature verification. + type: string + signatures: + description: JWS signatures per A2A spec §8.4.2. + items: + description: AgentCardSignature represents a JWS signature on + an AgentCard (A2A spec §8.4.2). + properties: + header: + description: Header contains optional unprotected JWS header + parameters. + properties: + timestamp: + description: Timestamp is when the signature was created + (ISO 8601 string) + type: string + type: object + protected: + description: Protected is the base64url-encoded JWS protected + header (contains alg, kid, x5c). + type: string + signature: + description: Signature is the base64url-encoded JWS signature + value. + type: string + required: + - protected + - signature + type: object + type: array + skills: + description: |- + Skills represent the abilities of an agent. A skill is a focused set of + behaviors that the agent is likely to succeed at. + items: + description: AgentSkill represents a skill offered by the agent. + properties: + description: + description: A detailed description of the skill. + type: string + examples: + description: Example prompts or scenarios that this skill + can handle. + items: + type: string + type: array + id: + description: A unique identifier for the agent's skill. + type: string + inputModes: + description: |- + The set of supported input media types for this skill, overriding the + agent's defaults. + items: + type: string + type: array + name: + description: A human-readable name for the skill. + type: string + outputModes: + description: |- + The set of supported output media types for this skill, overriding the + agent's defaults. + items: + type: string + type: array + parameters: + description: The parameters accepted by this skill. + items: + description: SkillParameter defines a parameter accepted + by a skill. + properties: + default: + description: The default value for this parameter. + type: string + description: + description: A human-readable description of the parameter. + type: string + name: + description: The name of the parameter. + type: string + required: + description: Indicates if this parameter must be provided. + type: boolean + type: + description: The type of the parameter (e.g., "string", + "number", "boolean", "object", "array"). + type: string + type: object + type: array + tags: + description: A set of keywords describing the skill's capabilities. + items: + type: string + type: array + type: object + type: array + supportsAuthenticatedExtendedCard: + description: |- + Indicates if the agent supports providing an extended agent card when + authenticated. + type: boolean + url: + description: The URL of the agent's endpoint. + type: string + validSignature: + description: ValidSignature is the result of JWS signature verification. + type: boolean + version: + description: The version of the agent. + type: string + type: object conditions: description: Conditions represent the current state of the AgentRuntime items: diff --git a/kagenti-operator/api/v1alpha1/agentruntime_types.go b/kagenti-operator/api/v1alpha1/agentruntime_types.go index 7aa1d864..9c81cc1d 100644 --- a/kagenti-operator/api/v1alpha1/agentruntime_types.go +++ b/kagenti-operator/api/v1alpha1/agentruntime_types.go @@ -167,6 +167,41 @@ type SamplingSpec struct { Rate float64 `json:"rate"` } +// CardStatus holds the fetched A2A agent card data along with fetch metadata +// and optional verification results. Populated by the card discovery phase when +// --enable-card-discovery is set. +type CardStatus struct { + AgentCardData `json:",inline"` + + // FetchedAt is the timestamp of the last successful card fetch. + // +optional + FetchedAt *metav1.Time `json:"fetchedAt,omitempty"` + + // CardID is a SHA-256 content hash of the fetched card data. + // +optional + CardID string `json:"cardId,omitempty"` + + // Protocol is the detected agent protocol (e.g., "a2a"). + // +optional + Protocol string `json:"protocol,omitempty"` + + // ValidSignature is the result of JWS signature verification. + // +optional + ValidSignature *bool `json:"validSignature,omitempty"` + + // SignatureKeyID is the key ID from the verified JWS header. + // +optional + SignatureKeyID string `json:"signatureKeyID,omitempty"` + + // SignatureVerificationDetails contains details or errors from signature verification. + // +optional + SignatureVerificationDetails string `json:"signatureVerificationDetails,omitempty"` + + // AttestedAgentSpiffeID is the SPIFFE ID extracted from the mTLS peer certificate. + // +optional + AttestedAgentSpiffeID string `json:"attestedAgentSpiffeID,omitempty"` +} + // AgentRuntimeStatus defines the observed state of AgentRuntime. type AgentRuntimeStatus struct { // Phase is the high-level state of the AgentRuntime @@ -177,6 +212,10 @@ type AgentRuntimeStatus struct { // +optional ConfiguredPods int32 `json:"configuredPods,omitempty"` + // Card holds A2A agent card data discovered from the workload's Service endpoint. + // +optional + Card *CardStatus `json:"card,omitempty"` + // Conditions represent the current state of the AgentRuntime // +optional Conditions []metav1.Condition `json:"conditions,omitempty"` @@ -188,6 +227,7 @@ type AgentRuntimeStatus struct { // +kubebuilder:printcolumn:name="Type",type="string",JSONPath=".spec.type",description="Workload Type" // +kubebuilder:printcolumn:name="Target",type="string",JSONPath=".spec.targetRef.name",description="Target Workload" // +kubebuilder:printcolumn:name="Phase",type="string",JSONPath=".status.phase",description="Runtime Phase" +// +kubebuilder:printcolumn:name="CardSynced",type="string",JSONPath=".status.conditions[?(@.type=='CardSynced')].status",description="Card Sync Status",priority=1 // +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp" // AgentRuntime attaches runtime configuration to a backing workload classified as an diff --git a/kagenti-operator/api/v1alpha1/zz_generated.deepcopy.go b/kagenti-operator/api/v1alpha1/zz_generated.deepcopy.go index 1576f880..d5ff4707 100644 --- a/kagenti-operator/api/v1alpha1/zz_generated.deepcopy.go +++ b/kagenti-operator/api/v1alpha1/zz_generated.deepcopy.go @@ -397,6 +397,11 @@ func (in *AgentRuntimeSpec) DeepCopy() *AgentRuntimeSpec { // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *AgentRuntimeStatus) DeepCopyInto(out *AgentRuntimeStatus) { *out = *in + if in.Card != nil { + in, out := &in.Card, &out.Card + *out = new(CardStatus) + (*in).DeepCopyInto(*out) + } if in.Conditions != nil { in, out := &in.Conditions, &out.Conditions *out = make([]v1.Condition, len(*in)) @@ -477,6 +482,31 @@ func (in *BindingStatus) DeepCopy() *BindingStatus { return out } +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *CardStatus) DeepCopyInto(out *CardStatus) { + *out = *in + in.AgentCardData.DeepCopyInto(&out.AgentCardData) + if in.FetchedAt != nil { + in, out := &in.FetchedAt, &out.FetchedAt + *out = (*in).DeepCopy() + } + if in.ValidSignature != nil { + in, out := &in.ValidSignature, &out.ValidSignature + *out = new(bool) + **out = **in + } +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new CardStatus. +func (in *CardStatus) DeepCopy() *CardStatus { + if in == nil { + return nil + } + out := new(CardStatus) + in.DeepCopyInto(out) + return out +} + // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *IdentityBinding) DeepCopyInto(out *IdentityBinding) { *out = *in diff --git a/kagenti-operator/cmd/main.go b/kagenti-operator/cmd/main.go index 6ee3c106..5e1730ea 100644 --- a/kagenti-operator/cmd/main.go +++ b/kagenti-operator/cmd/main.go @@ -101,6 +101,8 @@ func main() { var enforceNetworkPolicies bool var enableMLflow bool + var enableCardDiscovery bool + var enableVerifiedFetch bool var verifiedFetchSpiffeSocket string @@ -142,6 +144,8 @@ func main() { flag.BoolVar(&enableMLflow, "enable-mlflow", false, "Enable MLflow experiment tracking integration") + flag.BoolVar(&enableCardDiscovery, "enable-card-discovery", false, + "Enable automatic agent card discovery from AgentRuntime workloads into status.card") flag.BoolVar(&enableVerifiedFetch, "enable-verified-fetch", false, "Enable mTLS-authenticated fetch of agent cards via SPIFFE identity") flag.StringVar(&verifiedFetchSpiffeSocket, "verified-fetch-spiffe-socket", @@ -419,12 +423,23 @@ func main() { os.Exit(1) } - if err = (&controller.AgentRuntimeReconciler{ - Client: mgr.GetClient(), - APIReader: mgr.GetAPIReader(), - Scheme: mgr.GetScheme(), - Recorder: mgr.GetEventRecorderFor("agentruntime-controller"), - }).SetupWithManager(mgr); err != nil { + artReconciler := &controller.AgentRuntimeReconciler{ + Client: mgr.GetClient(), + APIReader: mgr.GetAPIReader(), + Scheme: mgr.GetScheme(), + Recorder: mgr.GetEventRecorderFor("agentruntime-controller"), + EnableCardDiscovery: enableCardDiscovery, + SpireTrustDomain: spireTrustDomain, + } + if enableCardDiscovery { + artReconciler.AgentFetcher = agentFetcher + artReconciler.SignatureProvider = sigProvider + if authenticatedFetcher != nil { + artReconciler.AuthenticatedFetcher = authenticatedFetcher + } + setupLog.Info("Card discovery enabled for AgentRuntime controller") + } + if err = artReconciler.SetupWithManager(mgr); err != nil { setupLog.Error(err, "unable to create controller", "controller", "AgentRuntime") os.Exit(1) } diff --git a/kagenti-operator/config/crd/bases/agent.kagenti.dev_agentruntimes.yaml b/kagenti-operator/config/crd/bases/agent.kagenti.dev_agentruntimes.yaml index 8d5fcb6b..ff811c0b 100644 --- a/kagenti-operator/config/crd/bases/agent.kagenti.dev_agentruntimes.yaml +++ b/kagenti-operator/config/crd/bases/agent.kagenti.dev_agentruntimes.yaml @@ -30,6 +30,11 @@ spec: jsonPath: .status.phase name: Phase type: string + - description: Card Sync Status + jsonPath: .status.conditions[?(@.type=='CardSynced')].status + name: CardSynced + priority: 1 + type: string - jsonPath: .metadata.creationTimestamp name: Age type: date @@ -131,6 +136,13 @@ spec: mtls.mode > "disabled". Setting mtlsMode != disabled implicitly requires SPIRE — the operator auto-enables spire for the workload. + CR-empty vs CR="disabled" are observably different in + `kubectl get agentruntime -o yaml` (the former omits the field, + the latter shows mtlsMode: disabled) but produce the same + effective mode: empty falls through to the namespace ConfigMap, + "disabled" is an explicit override that pins mode off even when + the namespace default is non-disabled. + Note: changing mtlsMode triggers a pod rollout because authbridge cannot hot-reload mTLS config (the byte-peek listener is wired at process start). @@ -198,6 +210,220 @@ spec: status: description: AgentRuntimeStatus defines the observed state of AgentRuntime. properties: + card: + description: Card holds A2A agent card data discovered from the workload's + Service endpoint. + properties: + attestedAgentSpiffeID: + description: AttestedAgentSpiffeID is the SPIFFE ID extracted + from the mTLS peer certificate. + type: string + capabilities: + description: The A2A capability set supported by the agent. + properties: + extensions: + description: A list of protocol extensions supported by the + agent. + items: + description: AgentExtension describes an A2A protocol extension + supported by the agent. + properties: + description: + description: A human-readable description of how this + agent uses the extension. + type: string + params: + additionalProperties: + x-kubernetes-preserve-unknown-fields: true + description: Extension-specific configuration parameters. + type: object + required: + description: |- + If true, the client must understand and comply with the extension's + requirements. + type: boolean + uri: + description: The unique URI identifying the extension. + type: string + type: object + type: array + pushNotifications: + description: |- + Indicates if the agent supports sending push notifications for + asynchronous task updates. + type: boolean + streaming: + description: Indicates if the agent supports streaming responses. + type: boolean + type: object + cardId: + description: CardID is a SHA-256 content hash of the fetched card + data. + type: string + defaultInputModes: + description: |- + The set of interaction modes that the agent supports across all skills, + defined as media types. + items: + type: string + type: array + defaultOutputModes: + description: The media types supported as outputs from this agent. + items: + type: string + type: array + description: + description: |- + A human-readable description of the agent, assisting users and other + agents in understanding its purpose. + type: string + documentationUrl: + description: A URL providing additional documentation about the + agent. + type: string + fetchedAt: + description: FetchedAt is the timestamp of the last successful + card fetch. + format: date-time + type: string + iconUrl: + description: A URL to an icon for the agent. + type: string + name: + description: A human-readable name for the agent. + type: string + protocol: + description: Protocol is the detected agent protocol (e.g., "a2a"). + type: string + provider: + description: The service provider of the agent. + properties: + organization: + description: The name of the agent provider's organization. + type: string + url: + description: A URL for the agent provider's website or relevant + documentation. + type: string + type: object + signatureKeyID: + description: SignatureKeyID is the key ID from the verified JWS + header. + type: string + signatureVerificationDetails: + description: SignatureVerificationDetails contains details or + errors from signature verification. + type: string + signatures: + description: JWS signatures per A2A spec §8.4.2. + items: + description: AgentCardSignature represents a JWS signature on + an AgentCard (A2A spec §8.4.2). + properties: + header: + description: Header contains optional unprotected JWS header + parameters. + properties: + timestamp: + description: Timestamp is when the signature was created + (ISO 8601 string) + type: string + type: object + protected: + description: Protected is the base64url-encoded JWS protected + header (contains alg, kid, x5c). + type: string + signature: + description: Signature is the base64url-encoded JWS signature + value. + type: string + required: + - protected + - signature + type: object + type: array + skills: + description: |- + Skills represent the abilities of an agent. A skill is a focused set of + behaviors that the agent is likely to succeed at. + items: + description: AgentSkill represents a skill offered by the agent. + properties: + description: + description: A detailed description of the skill. + type: string + examples: + description: Example prompts or scenarios that this skill + can handle. + items: + type: string + type: array + id: + description: A unique identifier for the agent's skill. + type: string + inputModes: + description: |- + The set of supported input media types for this skill, overriding the + agent's defaults. + items: + type: string + type: array + name: + description: A human-readable name for the skill. + type: string + outputModes: + description: |- + The set of supported output media types for this skill, overriding the + agent's defaults. + items: + type: string + type: array + parameters: + description: The parameters accepted by this skill. + items: + description: SkillParameter defines a parameter accepted + by a skill. + properties: + default: + description: The default value for this parameter. + type: string + description: + description: A human-readable description of the parameter. + type: string + name: + description: The name of the parameter. + type: string + required: + description: Indicates if this parameter must be provided. + type: boolean + type: + description: The type of the parameter (e.g., "string", + "number", "boolean", "object", "array"). + type: string + type: object + type: array + tags: + description: A set of keywords describing the skill's capabilities. + items: + type: string + type: array + type: object + type: array + supportsAuthenticatedExtendedCard: + description: |- + Indicates if the agent supports providing an extended agent card when + authenticated. + type: boolean + url: + description: The URL of the agent's endpoint. + type: string + validSignature: + description: ValidSignature is the result of JWS signature verification. + type: boolean + version: + description: The version of the agent. + type: string + type: object conditions: description: Conditions represent the current state of the AgentRuntime items: diff --git a/kagenti-operator/config/rbac/role.yaml b/kagenti-operator/config/rbac/role.yaml index 69056fe6..44a8d399 100644 --- a/kagenti-operator/config/rbac/role.yaml +++ b/kagenti-operator/config/rbac/role.yaml @@ -124,20 +124,20 @@ rules: - apiGroups: - mlflow.opendatahub.io resources: - - mlflows + - mlflowexperiments verbs: - get - list + - update - watch - apiGroups: - mlflow.opendatahub.io resources: - - mlflowexperiments + - mlflows verbs: - get - list - watch - - update - apiGroups: - networking.k8s.io resources: diff --git a/kagenti-operator/internal/controller/agentcard_controller.go b/kagenti-operator/internal/controller/agentcard_controller.go index 892304c3..24f392c3 100644 --- a/kagenti-operator/internal/controller/agentcard_controller.go +++ b/kagenti-operator/internal/controller/agentcard_controller.go @@ -171,6 +171,15 @@ func (r *AgentCardReconciler) Reconcile(ctx context.Context, req ctrl.Request) ( return ctrl.Result{}, nil } + if agentCard.CreationTimestamp.After(time.Now().Add(-5 * time.Minute)) { + agentCardLogger.Info("AgentCard is deprecated; card data is now available via AgentRuntime status.card. Migrate to AgentRuntime-based discovery.", + "agentCard", agentCard.Name, "namespace", agentCard.Namespace) + if r.Recorder != nil { + r.Recorder.Event(agentCard, corev1.EventTypeWarning, "Deprecated", + "AgentCard is deprecated; card data is now available via AgentRuntime status.card. Migrate to AgentRuntime-based discovery.") + } + } + workload, err := r.getWorkload(ctx, agentCard) if err != nil { agentCardLogger.Error(err, "Failed to get workload", "agentCard", agentCard.Name) diff --git a/kagenti-operator/internal/controller/agentcard_controller_test.go b/kagenti-operator/internal/controller/agentcard_controller_test.go index db120deb..52a079c0 100644 --- a/kagenti-operator/internal/controller/agentcard_controller_test.go +++ b/kagenti-operator/internal/controller/agentcard_controller_test.go @@ -20,6 +20,7 @@ import ( "context" "encoding/json" "errors" + "strings" "sync" "time" @@ -29,6 +30,7 @@ import ( corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/types" + "k8s.io/client-go/tools/record" "sigs.k8s.io/controller-runtime/pkg/event" "sigs.k8s.io/controller-runtime/pkg/reconcile" @@ -1722,6 +1724,113 @@ var _ = Describe("AgentCard Controller - getSyncPeriod", func() { Expect(result.RequeueAfter).To(Equal(10 * time.Second)) }) }) + + Context("Deprecation warning for new AgentCards", func() { + const ( + deprecationNS = "default" + deprecationDeploy = "deprecation-deploy" + deprecationCard = "deprecation-card" + ) + + ctx := context.Background() + + It("should emit deprecation event for recently created AgentCard", func() { + replicas := int32(1) + dep := &appsv1.Deployment{ + ObjectMeta: metav1.ObjectMeta{ + Name: deprecationDeploy, + Namespace: deprecationNS, + Labels: map[string]string{ + LabelAgentType: LabelValueAgent, + ProtocolLabelPrefix + "a2a": "", + }, + }, + Spec: appsv1.DeploymentSpec{ + Replicas: &replicas, + Selector: &metav1.LabelSelector{ + MatchLabels: map[string]string{"app": deprecationDeploy}, + }, + Template: corev1.PodTemplateSpec{ + ObjectMeta: metav1.ObjectMeta{ + Labels: map[string]string{ + "app": deprecationDeploy, + LabelAgentType: LabelValueAgent, + ProtocolLabelPrefix + "a2a": "", + }, + }, + Spec: corev1.PodSpec{ + Containers: []corev1.Container{{Name: "agent", Image: "test:latest"}}, + }, + }, + }, + } + Expect(k8sClient.Create(ctx, dep)).To(Succeed()) + defer func() { _ = k8sClient.Delete(ctx, dep) }() + + Eventually(func() error { + if err := k8sClient.Get(ctx, types.NamespacedName{Name: deprecationDeploy, Namespace: deprecationNS}, dep); err != nil { + return err + } + dep.Status.Conditions = []appsv1.DeploymentCondition{ + {Type: appsv1.DeploymentAvailable, Status: corev1.ConditionTrue}, + } + return k8sClient.Status().Update(ctx, dep) + }).Should(Succeed()) + + svc := &corev1.Service{ + ObjectMeta: metav1.ObjectMeta{Name: deprecationDeploy, Namespace: deprecationNS}, + Spec: corev1.ServiceSpec{ + Ports: []corev1.ServicePort{{Name: "http", Port: 8000, Protocol: corev1.ProtocolTCP}}, + Selector: map[string]string{"app": deprecationDeploy}, + }, + } + Expect(k8sClient.Create(ctx, svc)).To(Succeed()) + defer func() { _ = k8sClient.Delete(ctx, svc) }() + + ac := &agentv1alpha1.AgentCard{ + ObjectMeta: metav1.ObjectMeta{Name: deprecationCard, Namespace: deprecationNS}, + Spec: agentv1alpha1.AgentCardSpec{ + TargetRef: &agentv1alpha1.TargetRef{ + APIVersion: "apps/v1", + Kind: "Deployment", + Name: deprecationDeploy, + }, + }, + } + Expect(k8sClient.Create(ctx, ac)).To(Succeed()) + defer func() { _ = k8sClient.Delete(ctx, ac) }() + + fetcher := &mockFetcher{ + cardData: &agentv1alpha1.AgentCardData{ + Name: "deprecation-agent", + Version: "1.0.0", + URL: "http://deprecation-deploy.default.svc.cluster.local:8000", + }, + } + fakeRecorder := record.NewFakeRecorder(10) + reconciler := &AgentCardReconciler{ + Client: k8sClient, + Scheme: k8sClient.Scheme(), + AgentFetcher: fetcher, + Recorder: fakeRecorder, + } + + nn := types.NamespacedName{Name: deprecationCard, Namespace: deprecationNS} + _, _ = reconciler.Reconcile(ctx, reconcile.Request{NamespacedName: nn}) + _, err := reconciler.Reconcile(ctx, reconcile.Request{NamespacedName: nn}) + Expect(err).NotTo(HaveOccurred()) + + var foundDeprecation bool + for len(fakeRecorder.Events) > 0 { + evt := <-fakeRecorder.Events + if strings.Contains(evt, "Deprecated") { + foundDeprecation = true + break + } + } + Expect(foundDeprecation).To(BeTrue(), "expected a Deprecated event to be emitted") + }) + }) }) // Helper function to find a condition by type diff --git a/kagenti-operator/internal/controller/agentruntime_controller.go b/kagenti-operator/internal/controller/agentruntime_controller.go index 1434824d..e620d1b3 100644 --- a/kagenti-operator/internal/controller/agentruntime_controller.go +++ b/kagenti-operator/internal/controller/agentruntime_controller.go @@ -18,7 +18,11 @@ package controller import ( "context" + "crypto/sha256" + "encoding/hex" + "encoding/json" "fmt" + "strconv" "strings" "time" @@ -43,6 +47,8 @@ import ( "sigs.k8s.io/controller-runtime/pkg/reconcile" agentv1alpha1 "github.com/kagenti/operator/api/v1alpha1" + "github.com/kagenti/operator/internal/agentcard" + "github.com/kagenti/operator/internal/signature" ) const ( @@ -60,6 +66,11 @@ const ( ConditionTypeReady = "Ready" ConditionTypeTargetResolved = "TargetResolved" ConditionTypeConfigResolved = "ConfigResolved" + ConditionTypeCardSynced = "CardSynced" + + // AnnotationLastCardFetchHash stores the change-detection key used to skip + // redundant card fetches when the workload's pod template has not changed. + AnnotationLastCardFetchHash = "agent.kagenti.dev/last-card-fetch-hash" // KindSandbox is the workload kind for agent-sandbox CRs. KindSandbox = "Sandbox" @@ -80,6 +91,12 @@ type AgentRuntimeReconciler struct { Scheme *runtime.Scheme Recorder record.EventRecorder APIReader client.Reader // uncached reader for cross-namespace ConfigMap reads + + AgentFetcher agentcard.Fetcher + AuthenticatedFetcher agentcard.AuthenticatedFetcher + SignatureProvider signature.Provider + EnableCardDiscovery bool + SpireTrustDomain string } // +kubebuilder:rbac:groups=agent.kagenti.dev,resources=agentruntimes,verbs=get;list;watch;create;update;patch;delete @@ -91,6 +108,7 @@ type AgentRuntimeReconciler struct { // +kubebuilder:rbac:groups=agents.x-k8s.io,resources=sandboxes/scale,verbs=get;update;patch // +kubebuilder:rbac:groups=core,resources=configmaps,verbs=get;list;watch;create;update;patch // +kubebuilder:rbac:groups=core,resources=pods,verbs=get;list;watch +// +kubebuilder:rbac:groups=core,resources=services,verbs=get;list;watch // +kubebuilder:rbac:groups=core,resources=events,verbs=create;patch func (r *AgentRuntimeReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { @@ -178,6 +196,9 @@ func (r *AgentRuntimeReconciler) Reconcile(ctx context.Context, req ctrl.Request "Configuration resolved successfully") } + // 5.5. Card discovery phase: fetch agent card from Service endpoint + r.fetchAndUpdateCard(ctx, rt) + // 6. Apply labels and annotations to the target workload if err := r.applyWorkloadConfig(ctx, rt, configResult.Hash); err != nil { logger.Error(err, "Failed to apply workload config") @@ -423,6 +444,81 @@ func (r *AgentRuntimeReconciler) countConfiguredPods(ctx context.Context, rt *ag return count, nil } +// resolveServiceForWorkload finds the Service that fronts the target workload. +// It first tries a Service with the same name as the Deployment (standard convention), +// then falls back to selector matching against the Deployment's pod template labels. +func (r *AgentRuntimeReconciler) resolveServiceForWorkload(ctx context.Context, namespace string, ref agentv1alpha1.TargetRef) (*corev1.Service, int32, error) { + logger := log.FromContext(ctx) + + svc := &corev1.Service{} + if err := r.Get(ctx, types.NamespacedName{Name: ref.Name, Namespace: namespace}, svc); err == nil { + port := serviceHTTPPort(svc) + logger.V(1).Info("Resolved service by name", "service", ref.Name, "port", port) + return svc, port, nil + } + + acc, ok := newRuntimePodTemplateAccessor(ref.Kind) + if !ok { + return nil, 0, fmt.Errorf("unsupported workload kind for service resolution: %s", ref.Kind) + } + if err := r.Get(ctx, types.NamespacedName{Name: ref.Name, Namespace: namespace}, acc.obj); err != nil { + return nil, 0, fmt.Errorf("failed to get workload %s/%s: %w", ref.Kind, ref.Name, err) + } + podLabels := acc.getPodLabels(acc.obj) + if len(podLabels) == 0 { + return nil, 0, fmt.Errorf("workload %s/%s has no pod template labels for selector matching", ref.Kind, ref.Name) + } + + svcList := &corev1.ServiceList{} + if err := r.List(ctx, svcList, client.InNamespace(namespace)); err != nil { + return nil, 0, fmt.Errorf("failed to list services: %w", err) + } + + for i := range svcList.Items { + s := &svcList.Items[i] + if s.Spec.Selector == nil { + continue + } + if selectorMatchesLabels(s.Spec.Selector, podLabels) { + port := serviceHTTPPort(s) + logger.V(1).Info("Resolved service by selector match", "service", s.Name, "port", port) + return s, port, nil + } + } + + return nil, 0, fmt.Errorf("no Service matches workload %s/%s in namespace %s", ref.Kind, ref.Name, namespace) +} + +func selectorMatchesLabels(selector, labels map[string]string) bool { + for k, v := range selector { + if labels[k] != v { + return false + } + } + return true +} + +func serviceHTTPPort(svc *corev1.Service) int32 { + for _, p := range svc.Spec.Ports { + if strings.EqualFold(p.Name, "http") || p.Port == 80 || p.Port == 8080 || p.Port == 8000 { + return p.Port + } + } + if len(svc.Spec.Ports) > 0 { + return svc.Spec.Ports[0].Port + } + return 8000 +} + +func getAgentTLSPort(svc *corev1.Service) int32 { + for _, p := range svc.Spec.Ports { + if p.Name == AgentTLSPortName { + return p.Port + } + } + return 0 +} + // isPodOwnedByWorkload checks if a pod is transitively owned by the named workload. // For Deployments: Pod → ReplicaSet (-) → Deployment. // For StatefulSets: Pod is directly owned by the StatefulSet. @@ -533,6 +629,181 @@ func (r *AgentRuntimeReconciler) setCondition(rt *agentv1alpha1.AgentRuntime, co }) } +// fetchAndUpdateCard discovers the agent card from the workload's Service endpoint +// and populates status.card. Skips fetch when the feature flag is disabled or +// when the workload's change-detection key has not changed. +func (r *AgentRuntimeReconciler) fetchAndUpdateCard(ctx context.Context, rt *agentv1alpha1.AgentRuntime) { + logger := log.FromContext(ctx) + + if !r.EnableCardDiscovery { + if rt.Status.Card != nil { + rt.Status.Card = nil + r.setCondition(rt, ConditionTypeCardSynced, metav1.ConditionFalse, "CardDiscoveryDisabled", + "Card discovery is disabled; stale card data cleared") + } + return + } + + changeKey := r.workloadChangeKey(ctx, rt) + annotations := rt.GetAnnotations() + lastHash := "" + if annotations != nil { + lastHash = annotations[AnnotationLastCardFetchHash] + } + if changeKey != "" && changeKey == lastHash && rt.Status.Card != nil { + r.setCondition(rt, ConditionTypeCardSynced, metav1.ConditionTrue, "FetchSkipped", + "Pod template unchanged; existing card data still valid") + return + } + + svc, port, err := r.resolveServiceForWorkload(ctx, rt.Namespace, rt.Spec.TargetRef) + if err != nil { + logger.V(1).Info("Service resolution failed for card discovery", "error", err) + r.setCondition(rt, ConditionTypeCardSynced, metav1.ConditionFalse, "ServiceNotFound", err.Error()) + return + } + + protocol := agentcard.A2AProtocol + cardData, fetchResult, err := r.fetchCard(ctx, rt, svc, port, protocol) + if err != nil { + logger.Error(err, "Card fetch failed", "workload", rt.Spec.TargetRef.Name) + r.setCondition(rt, ConditionTypeCardSynced, metav1.ConditionFalse, "CardFetchFailed", err.Error()) + return + } + + newCardID := computeCardContentHash(cardData) + + cardStatus := &agentv1alpha1.CardStatus{ + AgentCardData: *cardData, + CardID: newCardID, + Protocol: protocol, + } + + if rt.Status.Card != nil && rt.Status.Card.CardID == newCardID { + cardStatus.FetchedAt = rt.Status.Card.FetchedAt + } else { + now := metav1.Now() + cardStatus.FetchedAt = &now + } + + if fetchResult != nil && fetchResult.AgentSpiffeID != "" { + cardStatus.AttestedAgentSpiffeID = fetchResult.AgentSpiffeID + } + + if r.SignatureProvider != nil && len(cardData.Signatures) > 0 { + vr, verifyErr := r.SignatureProvider.VerifySignature(ctx, cardData, cardData.Signatures) + if verifyErr != nil { + logger.Error(verifyErr, "Signature verification infrastructure error") + cardStatus.SignatureVerificationDetails = verifyErr.Error() + } else if vr != nil { + cardStatus.ValidSignature = &vr.Verified + cardStatus.SignatureKeyID = vr.KeyID + cardStatus.SignatureVerificationDetails = vr.Details + } + } + + rt.Status.Card = cardStatus + r.setCondition(rt, ConditionTypeCardSynced, metav1.ConditionTrue, "CardSynced", + fmt.Sprintf("Successfully fetched agent card for %s", cardData.Name)) + + r.persistCardFetchAnnotation(ctx, rt, changeKey) +} + +// persistCardFetchAnnotation writes the change-detection annotation to the +// AgentRuntime's metadata via a patch. Status().Update only persists the status +// subresource, so annotations must be written separately. +// +// Patch refreshes rt from the API server response, which overwrites any +// in-memory status mutations (conditions, card data) that have not yet been +// persisted via Status().Update. We save and restore the status to prevent this. +func (r *AgentRuntimeReconciler) persistCardFetchAnnotation(ctx context.Context, rt *agentv1alpha1.AgentRuntime, changeKey string) { + logger := log.FromContext(ctx) + + savedStatus := rt.Status.DeepCopy() + + patch := client.MergeFrom(rt.DeepCopy()) + annotations := rt.GetAnnotations() + if annotations == nil { + annotations = make(map[string]string) + } + annotations[AnnotationLastCardFetchHash] = changeKey + rt.SetAnnotations(annotations) + if err := r.Patch(ctx, rt, patch); err != nil { + logger.Error(err, "Failed to persist card fetch annotation") + } + + rt.Status = *savedStatus +} + +// fetchCard retrieves the agent card, choosing mTLS or plain HTTP based on +// service port availability and fetcher configuration. +func (r *AgentRuntimeReconciler) fetchCard( + ctx context.Context, rt *agentv1alpha1.AgentRuntime, + svc *corev1.Service, port int32, protocol string, +) (*agentv1alpha1.AgentCardData, *agentcard.FetchResult, error) { + logger := log.FromContext(ctx) + ref := rt.Spec.TargetRef + + if r.AuthenticatedFetcher != nil { + tlsPort := getAgentTLSPort(svc) + if tlsPort > 0 { + secureURL := agentcard.GetSecureServiceURL(svc.Name, rt.Namespace, tlsPort) + fetchResult, err := r.AuthenticatedFetcher.FetchAuthenticated(ctx, protocol, secureURL) + if err != nil { + return nil, nil, fmt.Errorf("authenticated fetch failed for %s: %w", ref.Name, err) + } + if fetchResult.CardData == nil { + return nil, nil, fmt.Errorf("authenticated fetch returned nil card data for %s", ref.Name) + } + return fetchResult.CardData, fetchResult, nil + } + logger.Info("TLS port not found, falling back to HTTP fetch", + "service", svc.Name, "expectedPortName", AgentTLSPortName) + if r.Recorder != nil { + r.Recorder.Event(rt, corev1.EventTypeWarning, "FallbackToHTTP", + fmt.Sprintf("Service %s has no %s port; fetch is unverified", svc.Name, AgentTLSPortName)) + } + } + + if r.AgentFetcher == nil { + return nil, nil, fmt.Errorf("no fetcher configured for card discovery") + } + + serviceURL := agentcard.GetServiceURL(svc.Name, rt.Namespace, port) + cardData, err := r.AgentFetcher.Fetch(ctx, protocol, serviceURL, ref.Name, rt.Namespace) + if err != nil { + return nil, nil, fmt.Errorf("fetch failed for %s: %w", ref.Name, err) + } + return cardData, nil, nil +} + +// workloadChangeKey returns a string that changes when the workload's pod +// template changes. For Deployments this is the observed generation; +// for StatefulSets and Sandboxes it is the resource generation. +func (r *AgentRuntimeReconciler) workloadChangeKey(ctx context.Context, rt *agentv1alpha1.AgentRuntime) string { + ref := rt.Spec.TargetRef + acc, ok := newRuntimePodTemplateAccessor(ref.Kind) + if !ok { + return "" + } + if err := r.Get(ctx, types.NamespacedName{Name: ref.Name, Namespace: rt.Namespace}, acc.obj); err != nil { + return "" + } + return strconv.FormatInt(acc.obj.GetGeneration(), 10) +} + +func computeCardContentHash(cardData *agentv1alpha1.AgentCardData) string { + if cardData == nil { + return "" + } + data, err := json.Marshal(cardData) + if err != nil { + return "" + } + hash := sha256.Sum256(data) + return hex.EncodeToString(hash[:]) +} + // templateConfigMapNames lists the well-known ConfigMaps that authbridge sidecars // require. The Helm chart and backend API create these in agent namespaces; the // operator copies templates from kagenti-system for namespaces created by other diff --git a/kagenti-operator/internal/controller/agentruntime_controller_test.go b/kagenti-operator/internal/controller/agentruntime_controller_test.go index 5f9043c9..357bdc09 100644 --- a/kagenti-operator/internal/controller/agentruntime_controller_test.go +++ b/kagenti-operator/internal/controller/agentruntime_controller_test.go @@ -23,6 +23,7 @@ import ( . "github.com/onsi/gomega" appsv1 "k8s.io/api/apps/v1" corev1 "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/api/meta" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/apis/meta/v1/unstructured" "k8s.io/apimachinery/pkg/types" @@ -34,6 +35,15 @@ import ( agentv1alpha1 "github.com/kagenti/operator/api/v1alpha1" ) +type stubCardFetcher struct { + card *agentv1alpha1.AgentCardData + err error +} + +func (f *stubCardFetcher) Fetch(_ context.Context, _, _, _, _ string) (*agentv1alpha1.AgentCardData, error) { + return f.card, f.err +} + var _ = Describe("AgentRuntime Controller", func() { const ( rtName = "test-agentruntime" @@ -639,6 +649,327 @@ var _ = Describe("AgentRuntime Controller", func() { }) }) + Context("Service resolution for card discovery", func() { + It("should resolve service by name match", func() { + dep := newDeployment("svc-name-deploy", namespace) + Expect(k8sClient.Create(ctx, dep)).To(Succeed()) + defer func() { _ = k8sClient.Delete(ctx, dep) }() + + svc := &corev1.Service{ + ObjectMeta: metav1.ObjectMeta{ + Name: "svc-name-deploy", + Namespace: namespace, + }, + Spec: corev1.ServiceSpec{ + Ports: []corev1.ServicePort{ + {Name: "http", Port: 8000, Protocol: corev1.ProtocolTCP}, + }, + Selector: map[string]string{"app": "svc-name-deploy"}, + }, + } + Expect(k8sClient.Create(ctx, svc)).To(Succeed()) + defer func() { _ = k8sClient.Delete(ctx, svc) }() + + r := newReconciler() + ref := agentv1alpha1.TargetRef{APIVersion: "apps/v1", Kind: "Deployment", Name: "svc-name-deploy"} + resolvedSvc, port, err := r.resolveServiceForWorkload(ctx, namespace, ref) + Expect(err).NotTo(HaveOccurred()) + Expect(resolvedSvc.Name).To(Equal("svc-name-deploy")) + Expect(port).To(Equal(int32(8000))) + }) + + It("should resolve service by selector match when name does not match", func() { + dep := newDeployment("selector-deploy", namespace) + Expect(k8sClient.Create(ctx, dep)).To(Succeed()) + defer func() { _ = k8sClient.Delete(ctx, dep) }() + + svc := &corev1.Service{ + ObjectMeta: metav1.ObjectMeta{ + Name: "different-svc-name", + Namespace: namespace, + }, + Spec: corev1.ServiceSpec{ + Ports: []corev1.ServicePort{ + {Name: "http", Port: 9090, Protocol: corev1.ProtocolTCP}, + }, + Selector: map[string]string{"app": "selector-deploy"}, + }, + } + Expect(k8sClient.Create(ctx, svc)).To(Succeed()) + defer func() { _ = k8sClient.Delete(ctx, svc) }() + + r := newReconciler() + ref := agentv1alpha1.TargetRef{APIVersion: "apps/v1", Kind: "Deployment", Name: "selector-deploy"} + resolvedSvc, port, err := r.resolveServiceForWorkload(ctx, namespace, ref) + Expect(err).NotTo(HaveOccurred()) + Expect(resolvedSvc.Name).To(Equal("different-svc-name")) + Expect(port).To(Equal(int32(9090))) + }) + + It("should return error when no matching service exists", func() { + dep := newDeployment("no-svc-deploy", namespace) + Expect(k8sClient.Create(ctx, dep)).To(Succeed()) + defer func() { _ = k8sClient.Delete(ctx, dep) }() + + r := newReconciler() + ref := agentv1alpha1.TargetRef{APIVersion: "apps/v1", Kind: "Deployment", Name: "no-svc-deploy"} + _, _, err := r.resolveServiceForWorkload(ctx, namespace, ref) + Expect(err).To(HaveOccurred()) + Expect(err.Error()).To(ContainSubstring("no Service matches")) + }) + + It("should prefer first HTTP port", func() { + dep := newDeployment("multi-port-deploy", namespace) + Expect(k8sClient.Create(ctx, dep)).To(Succeed()) + defer func() { _ = k8sClient.Delete(ctx, dep) }() + + svc := &corev1.Service{ + ObjectMeta: metav1.ObjectMeta{ + Name: "multi-port-deploy", + Namespace: namespace, + }, + Spec: corev1.ServiceSpec{ + Ports: []corev1.ServicePort{ + {Name: "grpc", Port: 50051, Protocol: corev1.ProtocolTCP}, + {Name: "http", Port: 8080, Protocol: corev1.ProtocolTCP}, + }, + Selector: map[string]string{"app": "multi-port-deploy"}, + }, + } + Expect(k8sClient.Create(ctx, svc)).To(Succeed()) + defer func() { _ = k8sClient.Delete(ctx, svc) }() + + r := newReconciler() + ref := agentv1alpha1.TargetRef{APIVersion: "apps/v1", Kind: "Deployment", Name: "multi-port-deploy"} + _, port, err := r.resolveServiceForWorkload(ctx, namespace, ref) + Expect(err).NotTo(HaveOccurred()) + Expect(port).To(Equal(int32(8080))) + }) + }) + + Context("Card fetch phase", func() { + It("should skip card fetch when feature flag is disabled", func() { + rt := &agentv1alpha1.AgentRuntime{ + ObjectMeta: metav1.ObjectMeta{Name: "no-card-rt", Namespace: namespace}, + Status: agentv1alpha1.AgentRuntimeStatus{}, + } + + r := &AgentRuntimeReconciler{ + Client: k8sClient, + EnableCardDiscovery: false, + } + r.fetchAndUpdateCard(ctx, rt) + Expect(rt.Status.Card).To(BeNil()) + + var cardCond *metav1.Condition + for i := range rt.Status.Conditions { + if rt.Status.Conditions[i].Type == ConditionTypeCardSynced { + cardCond = &rt.Status.Conditions[i] + break + } + } + Expect(cardCond).To(BeNil(), "No CardSynced condition should be set when card was already nil") + }) + + It("should clear existing card data when feature flag is disabled", func() { + now := metav1.Now() + rt := &agentv1alpha1.AgentRuntime{ + ObjectMeta: metav1.ObjectMeta{Name: "clear-card-rt", Namespace: namespace}, + Status: agentv1alpha1.AgentRuntimeStatus{ + Card: &agentv1alpha1.CardStatus{ + AgentCardData: agentv1alpha1.AgentCardData{Name: "old-agent"}, + FetchedAt: &now, + }, + }, + } + + r := &AgentRuntimeReconciler{ + Client: k8sClient, + EnableCardDiscovery: false, + } + r.fetchAndUpdateCard(ctx, rt) + Expect(rt.Status.Card).To(BeNil()) + + var cardCond *metav1.Condition + for i := range rt.Status.Conditions { + if rt.Status.Conditions[i].Type == ConditionTypeCardSynced { + cardCond = &rt.Status.Conditions[i] + break + } + } + Expect(cardCond).NotTo(BeNil()) + Expect(cardCond.Status).To(Equal(metav1.ConditionFalse)) + Expect(cardCond.Reason).To(Equal("CardDiscoveryDisabled")) + }) + + It("should set ServiceNotFound condition when no service exists", func() { + dep := newDeployment("card-no-svc-deploy", namespace) + Expect(k8sClient.Create(ctx, dep)).To(Succeed()) + defer func() { _ = k8sClient.Delete(ctx, dep) }() + + rt := &agentv1alpha1.AgentRuntime{ + ObjectMeta: metav1.ObjectMeta{Name: "card-no-svc-rt", Namespace: namespace}, + Spec: agentv1alpha1.AgentRuntimeSpec{ + Type: agentv1alpha1.RuntimeTypeAgent, + TargetRef: agentv1alpha1.TargetRef{APIVersion: "apps/v1", Kind: "Deployment", Name: "card-no-svc-deploy"}, + }, + } + + r := &AgentRuntimeReconciler{ + Client: k8sClient, + EnableCardDiscovery: true, + } + r.fetchAndUpdateCard(ctx, rt) + Expect(rt.Status.Card).To(BeNil()) + + var cardCond *metav1.Condition + for i := range rt.Status.Conditions { + if rt.Status.Conditions[i].Type == ConditionTypeCardSynced { + cardCond = &rt.Status.Conditions[i] + break + } + } + Expect(cardCond).NotTo(BeNil()) + Expect(cardCond.Status).To(Equal(metav1.ConditionFalse)) + Expect(cardCond.Reason).To(Equal("ServiceNotFound")) + }) + }) + + Context("Card data retention on fetch failure (FR-013)", func() { + It("should retain existing card data when fetch fails", func() { + now := metav1.Now() + rt := &agentv1alpha1.AgentRuntime{ + ObjectMeta: metav1.ObjectMeta{Name: "retain-card-rt", Namespace: namespace}, + Spec: agentv1alpha1.AgentRuntimeSpec{ + Type: agentv1alpha1.RuntimeTypeAgent, + TargetRef: agentv1alpha1.TargetRef{APIVersion: "apps/v1", Kind: "Deployment", Name: "nonexistent-for-retain"}, + }, + Status: agentv1alpha1.AgentRuntimeStatus{ + Card: &agentv1alpha1.CardStatus{ + AgentCardData: agentv1alpha1.AgentCardData{Name: "previous-agent", Version: "1.0"}, + FetchedAt: &now, + CardID: "abc123", + }, + }, + } + + r := &AgentRuntimeReconciler{ + Client: k8sClient, + EnableCardDiscovery: true, + } + r.fetchAndUpdateCard(ctx, rt) + + Expect(rt.Status.Card).NotTo(BeNil(), "existing card data should be retained on fetch failure") + Expect(rt.Status.Card.Name).To(Equal("previous-agent")) + Expect(rt.Status.Card.CardID).To(Equal("abc123")) + + var cardCond *metav1.Condition + for i := range rt.Status.Conditions { + if rt.Status.Conditions[i].Type == ConditionTypeCardSynced { + cardCond = &rt.Status.Conditions[i] + break + } + } + Expect(cardCond).NotTo(BeNil()) + Expect(cardCond.Status).To(Equal(metav1.ConditionFalse)) + }) + }) + + Context("Feature flag toggle behavior", func() { + It("should not fetch card when flag is disabled and card is nil", func() { + rt := &agentv1alpha1.AgentRuntime{ + ObjectMeta: metav1.ObjectMeta{Name: "toggle-off-nil-rt", Namespace: namespace}, + Status: agentv1alpha1.AgentRuntimeStatus{}, + } + r := &AgentRuntimeReconciler{Client: k8sClient, EnableCardDiscovery: false} + r.fetchAndUpdateCard(ctx, rt) + Expect(rt.Status.Card).To(BeNil()) + // No CardSynced condition should be set when card was already nil + }) + + It("should clear populated card data when flag is toggled off", func() { + now := metav1.Now() + rt := &agentv1alpha1.AgentRuntime{ + ObjectMeta: metav1.ObjectMeta{Name: "toggle-off-populated-rt", Namespace: namespace}, + Status: agentv1alpha1.AgentRuntimeStatus{ + Card: &agentv1alpha1.CardStatus{ + AgentCardData: agentv1alpha1.AgentCardData{Name: "stale-agent"}, + FetchedAt: &now, + }, + }, + } + r := &AgentRuntimeReconciler{Client: k8sClient, EnableCardDiscovery: false} + r.fetchAndUpdateCard(ctx, rt) + Expect(rt.Status.Card).To(BeNil()) + }) + }) + + Context("Card annotation patch must not wipe in-memory status", func() { + It("should persist CardSynced condition and card data after annotation patch", func() { + depName := "card-patch-deploy" + svcName := depName + dep := newDeployment(depName, namespace) + Expect(k8sClient.Create(ctx, dep)).To(Succeed()) + defer func() { _ = k8sClient.Delete(ctx, dep) }() + + svc := &corev1.Service{ + ObjectMeta: metav1.ObjectMeta{Name: svcName, Namespace: namespace}, + Spec: corev1.ServiceSpec{ + Selector: map[string]string{"app": depName}, + Ports: []corev1.ServicePort{{Name: "http", Port: 8080, Protocol: corev1.ProtocolTCP}}, + }, + } + Expect(k8sClient.Create(ctx, svc)).To(Succeed()) + defer func() { _ = k8sClient.Delete(ctx, svc) }() + + rt := newAgentRuntime("card-patch-rt", namespace, depName, agentv1alpha1.RuntimeTypeAgent) + Expect(k8sClient.Create(ctx, rt)).To(Succeed()) + defer func() { _ = k8sClient.Delete(ctx, rt) }() + + // Pre-set conditions that would be set earlier in the reconcile loop + meta.SetStatusCondition(&rt.Status.Conditions, metav1.Condition{ + Type: ConditionTypeTargetResolved, Status: metav1.ConditionTrue, + Reason: "TargetFound", Message: "resolved", + }) + meta.SetStatusCondition(&rt.Status.Conditions, metav1.Condition{ + Type: ConditionTypeConfigResolved, Status: metav1.ConditionTrue, + Reason: "ConfigResolved", Message: "resolved", + }) + + stubFetcher := &stubCardFetcher{ + card: &agentv1alpha1.AgentCardData{ + Name: "Test Agent", + Version: "2.0", + }, + } + + r := &AgentRuntimeReconciler{ + Client: k8sClient, + EnableCardDiscovery: true, + AgentFetcher: stubFetcher, + } + r.fetchAndUpdateCard(ctx, rt) + + // Card data must survive the annotation patch + Expect(rt.Status.Card).NotTo(BeNil(), "card data must not be wiped by annotation patch") + Expect(rt.Status.Card.Name).To(Equal("Test Agent")) + Expect(rt.Status.Card.Version).To(Equal("2.0")) + Expect(rt.Status.Card.CardID).NotTo(BeEmpty()) + + // CardSynced condition must survive + cardCond := meta.FindStatusCondition(rt.Status.Conditions, ConditionTypeCardSynced) + Expect(cardCond).NotTo(BeNil(), "CardSynced condition must not be wiped by annotation patch") + Expect(cardCond.Status).To(Equal(metav1.ConditionTrue)) + Expect(cardCond.Reason).To(Equal("CardSynced")) + + // Conditions set before fetchAndUpdateCard must also survive + targetCond := meta.FindStatusCondition(rt.Status.Conditions, ConditionTypeTargetResolved) + Expect(targetCond).NotTo(BeNil(), "TargetResolved condition must not be wiped by annotation patch") + configCond := meta.FindStatusCondition(rt.Status.Conditions, ConditionTypeConfigResolved) + Expect(configCond).NotTo(BeNil(), "ConfigResolved condition must not be wiped by annotation patch") + }) + }) + Context("Sandbox workload support", func() { It("should create a Sandbox accessor that reads/writes pod template labels and annotations", func() { acc, ok := newRuntimePodTemplateAccessor("Sandbox") diff --git a/specs/001-agentcard-into-status/REVIEWERS.md b/specs/001-agentcard-into-status/REVIEWERS.md new file mode 100644 index 00000000..051db4a9 --- /dev/null +++ b/specs/001-agentcard-into-status/REVIEWERS.md @@ -0,0 +1,66 @@ +# Review Guide: Consolidate AgentCard Data Into AgentRuntime Status + +**Generated**: 2026-05-21 | **Spec**: [spec.md](spec.md) + +## Why This Change + +AgentRuntime and AgentCard are two CRDs with identical cardinality, the same namespace, the same lifecycle, and the same owner. Operators currently need to cross-reference both resources to understand what an agent can do. AgentCard also conflates observation with policy (it's entirely controller-managed with no room for admin-authored policy fields), and its JWS signing pipeline signs a skeleton card with empty skills. This change consolidates the card data into AgentRuntime's status, reducing the API surface and simplifying the operator experience. + +## What Changes + +AgentRuntime gains a new `status.card` field that holds the A2A agent card payload (name, description, skills, capabilities), fetch metadata (timestamp, content hash, protocol), and identity verification results (SPIFFE ID, JWS signature validation). The controller fetches the card from `/.well-known/agent-card.json` on rollout events. mTLS verified fetch is included, reusing the infrastructure from PR #284. The entire feature is gated behind `--enable-card-discovery` (default: disabled). AgentCard CRD remains functional but emits deprecation warnings on new CR creation. + +## How It Works + +A new `fetchAndUpdateCard` phase is added to the existing AgentRuntime reconcile loop, after config hash computation. It resolves the agent's Service endpoint by name (matching existing convention) with selector-match fallback, fetches the card via the existing `agentcard.Fetcher` interface (plain HTTP) or `agentcard.AuthenticatedFetcher` (mTLS with SPIFFE), runs JWS signature verification via the existing `signature.Provider`, and writes the results to `status.card`. Re-fetch is triggered only by pod template hash changes, not periodic polling. The controller reuses all existing fetch, parse, and verify code from the AgentCard controller without creating new packages. + +## When It Applies + +**Applies when**: +- The operator is started with `--enable-card-discovery=true` +- An AgentRuntime targets a workload (Deployment, StatefulSet, or Sandbox) whose Pods serve `/.well-known/agent-card.json` +- The backing workload rolls out (pod template hash or generation changes) + +**Does not apply when**: +- `--enable-card-discovery` is not set (default). No card fetch occurs. +- AgentRuntime targets a workload without an A2A card endpoint. The controller sets a `CardSynced=False` condition and moves on. +- mTLS policy fields, AgentCard CRD removal, and migration tooling are deferred to future iterations. + +## Key Decisions + +1. **Extend existing controller (not a new one)**: The card fetch is a single HTTP GET added to the existing reconcile loop. Creating a separate controller would add coordination complexity for minimal isolation benefit. If performance becomes an issue at scale, extraction is a clean refactor. + +2. **Selector match for service resolution**: Resolves the workload's (Deployment, StatefulSet, Sandbox) Pod selector labels to find the matching Service. Falls back from name-based convention (matching AgentCard behavior) to selector matching. No annotations required. + +3. **Retain stale data on fetch failure**: When a fetch fails, the last successful card data is kept in `status.card`. The `CardSynced` condition and `fetchedAt` timestamp signal staleness. This avoids disruption for tools that consume `status.card`. + +4. **Clear data when feature flag is disabled**: When `--enable-card-discovery` is toggled off, `status.card` is cleared on the next reconcile. This prevents stale data from lingering when the feature is explicitly turned off. + +5. **mTLS included in this iteration**: Rather than deferring mTLS to a follow-up, the recently merged PR #284 infrastructure is reused directly. This avoids building a plain-HTTP-only path that would be immediately replaced. + +## Areas Needing Attention + +- The `fetchAndUpdateCard` method adds network I/O (HTTP GET) to a controller that currently only does Kubernetes API calls. The existing 10s timeout from `doHTTPFetch` applies, but a slow or unreachable agent endpoint will delay the reconcile loop by up to 10s. +- Pod template hash change detection is the trigger mechanism. If the agent updates its card content without a pod rollout, the controller will not re-fetch. This is an intentional tradeoff (no polling) documented in the spec. +- The `CardStatus` struct embeds `AgentCardData` which includes the `Signatures` field (JWS array). For agents that sign their cards, this means the raw JWS data is stored in AgentRuntime status, which could be large. +- Service resolution assumes at most one Service matches a Deployment's selector. Multiple matching Services will use the first one found, which may not be deterministic. + +## Open Questions + +No open questions identified. All clarifications were resolved during the spec clarification phase. + +## Review Checklist + +- [ ] Key decisions are justified +- [ ] Breaking changes are documented with migration guidance +- [ ] Scope matches the stated boundaries +- [ ] Success criteria are achievable +- [ ] No unstated assumptions +- [ ] `CardStatus` struct fields match the A2A agent card spec +- [ ] Feature flag default (disabled) is correct for backward compatibility +- [ ] Deprecation warning is non-disruptive (log + event, no behavior change) +- [ ] mTLS reuse from PR #284 is compatible with the AgentRuntime controller's lifecycle + +--- + + diff --git a/specs/001-agentcard-into-status/checklists/requirements.md b/specs/001-agentcard-into-status/checklists/requirements.md new file mode 100644 index 00000000..4f8ad699 --- /dev/null +++ b/specs/001-agentcard-into-status/checklists/requirements.md @@ -0,0 +1,36 @@ +# Specification Quality Checklist: Consolidate AgentCard Data Into AgentRuntime Status + +**Purpose**: Validate specification completeness and quality before proceeding to planning +**Created**: 2026-05-21 +**Feature**: [spec.md](../spec.md) + +## Content Quality + +- [x] No implementation details (languages, frameworks, APIs) +- [x] Focused on user value and business needs +- [x] Written for non-technical stakeholders +- [x] All mandatory sections completed + +## Requirement Completeness + +- [x] No [NEEDS CLARIFICATION] markers remain +- [x] Requirements are testable and unambiguous +- [x] Success criteria are measurable +- [x] Success criteria are technology-agnostic (no implementation details) +- [x] All acceptance scenarios are defined +- [x] Edge cases are identified +- [x] Scope is clearly bounded +- [x] Dependencies and assumptions identified + +## Feature Readiness + +- [x] All functional requirements have clear acceptance criteria +- [x] User scenarios cover primary flows +- [x] Feature meets measurable outcomes defined in Success Criteria +- [x] No implementation details leak into specification + +## Notes + +- All items pass. Spec is ready for `/speckit-clarify` or `/speckit-plan`. +- Reasonable defaults were chosen for: fetch trigger mechanism (pod template hash change), service endpoint resolution (selector matching), and feature flag default (disabled). +- Reuses the existing `AgentCardData` struct assumption, documented in Assumptions section. diff --git a/specs/001-agentcard-into-status/contracts/agentruntime-status-card.md b/specs/001-agentcard-into-status/contracts/agentruntime-status-card.md new file mode 100644 index 00000000..0e8e787e --- /dev/null +++ b/specs/001-agentcard-into-status/contracts/agentruntime-status-card.md @@ -0,0 +1,77 @@ +# Contract: AgentRuntime status.card + +## Overview + +The `status.card` field on the AgentRuntime CRD exposes discovered A2A agent card data, fetch metadata, and identity verification results. This contract defines the shape and semantics of this new status field. + +## CRD Status Extension + +```yaml +apiVersion: agent.kagenti.dev/v1alpha1 +kind: AgentRuntime +status: + phase: Active + configuredPods: 1 + conditions: + - type: CardSynced + status: "True" + reason: CardSynced + message: "Successfully fetched agent card for my-agent" + card: + # A2A card payload (from AgentCardData) + name: "my-agent" + description: "An example A2A agent" + version: "1.0.0" + url: "http://my-agent.default.svc.cluster.local:8000" + skills: + - id: "summarize" + name: "Summarize" + description: "Summarizes text input" + tags: ["nlp", "text"] + capabilities: + streaming: true + pushNotifications: false + defaultInputModes: ["text/plain"] + defaultOutputModes: ["text/plain"] + + # Fetch metadata + fetchedAt: "2026-05-21T10:30:00Z" + cardId: "a1b2c3d4e5f6..." + protocol: "a2a" + lastPodTemplateHash: "6b7c8d9e0f" + + # Verification fields (populated when mTLS is active) + validSignature: true + signatureKeyID: "key-001" + signatureVerificationDetails: "JWS signature valid (x5c chain verified)" + attestedAgentSpiffeID: "spiffe://example.org/ns/default/sa/my-agent" +``` + +## Condition Contract: CardSynced + +Added to `status.conditions[]` alongside existing conditions (Ready, TargetResolved, ConfigResolved). + +| Reason | Status | Trigger | +|--------|--------|---------| +| `CardSynced` | True | Successful fetch and parse | +| `FetchSkipped` | True | Pod template hash unchanged, existing data valid | +| `CardFetchFailed` | False | HTTP or mTLS connection error | +| `CardParseFailed` | False | Response is not valid A2A JSON | +| `ServiceNotFound` | False | No Service matches the Deployment selector | +| `WorkloadNotReady` | False | Target Deployment has zero ready Pods | +| `CardDiscoveryDisabled` | False | Feature flag is disabled | + +## Feature Flag + +``` +--enable-card-discovery=false # default: disabled +``` + +When disabled: `status.card` is nil, no `CardSynced` condition is set. If previously enabled and data exists, `status.card` is cleared on the next reconcile. + +## Backward Compatibility + +- No changes to `spec` fields on AgentRuntime +- No changes to existing `status` fields (phase, configuredPods, existing conditions) +- AgentCard CRD continues to function independently +- No API version bump required (additive status change only) diff --git a/specs/001-agentcard-into-status/data-model.md b/specs/001-agentcard-into-status/data-model.md new file mode 100644 index 00000000..9440f8ad --- /dev/null +++ b/specs/001-agentcard-into-status/data-model.md @@ -0,0 +1,92 @@ +# Data Model: Consolidate AgentCard Data Into AgentRuntime Status + +## Entity Changes + +### New: CardStatus (AgentRuntime status.card) + +Added to `AgentRuntimeStatus` as an optional field. + +**Fields**: + +| Field | Type | Description | Source | +|-------|------|-------------|--------| +| `name` | string | Agent name from A2A card | AgentCardData (embedded) | +| `description` | string | Agent description | AgentCardData (embedded) | +| `version` | string | Agent version | AgentCardData (embedded) | +| `url` | string | Agent endpoint URL | AgentCardData (embedded) | +| `documentationUrl` | string | Documentation URL | AgentCardData (embedded) | +| `iconUrl` | string | Agent icon URL | AgentCardData (embedded) | +| `provider` | AgentProvider | Service provider info | AgentCardData (embedded) | +| `capabilities` | AgentCapabilities | A2A capability set | AgentCardData (embedded) | +| `defaultInputModes` | []string | Supported input media types | AgentCardData (embedded) | +| `defaultOutputModes` | []string | Supported output media types | AgentCardData (embedded) | +| `skills` | []AgentSkill | Agent skills list | AgentCardData (embedded) | +| `supportsAuthenticatedExtendedCard` | *bool | Extended card support | AgentCardData (embedded) | +| `signatures` | []AgentCardSignature | JWS signatures | AgentCardData (embedded) | +| `fetchedAt` | *metav1.Time | Last successful fetch timestamp | New field | +| `cardId` | string | SHA-256 hash of card content | New field | +| `protocol` | string | Detected agent protocol (e.g., "a2a") | New field | +| `validSignature` | *bool | JWS signature validation result | New field | +| `signatureKeyID` | string | Key ID from verified JWS header | New field | +| `signatureVerificationDetails` | string | Verification details/error message | New field | +| `attestedAgentSpiffeID` | string | SPIFFE ID from mTLS peer certificate | New field | + +**Note**: Change-detection hash (`lastPodTemplateHash`) is stored as an annotation (`agent.kagenti.dev/last-card-fetch-hash`) on the AgentRuntime, not in the CRD status. This keeps the implementation mechanism out of the public API surface. + +### Modified: AgentRuntimeStatus + +| Field | Change | Before | After | +|-------|--------|--------|-------| +| `card` | Added | (not present) | `*CardStatus` (optional) | + +### Modified: AgentRuntimeReconciler + +| Field | Change | Description | +|-------|--------|-------------| +| `AgentFetcher` | Added | `agentcard.Fetcher` interface (plain HTTP) | +| `AuthenticatedFetcher` | Added | `agentcard.AuthenticatedFetcher` interface (mTLS) | +| `SignatureProvider` | Added | `signature.Provider` interface (JWS verification) | +| `EnableCardDiscovery` | Added | Feature flag (bool) | +| `SpireTrustDomain` | Added | SPIFFE trust domain string | + +### New Condition: CardSynced + +Added to AgentRuntime `status.conditions[]`. + +| Reason | Status | When | +|--------|--------|------| +| `CardSynced` | True | Card fetched and parsed successfully | +| `CardFetchFailed` | False | HTTP/mTLS fetch error | +| `CardParseFailed` | False | JSON parse error | +| `ServiceNotFound` | False | No Service matches the workload's selector | +| `WorkloadNotReady` | False | Workload has zero ready Pods | +| `FetchTimeout` | False | Card fetch exceeded 10-second timeout | +| `CardDiscoveryDisabled` | False | Feature flag is off (clears stale data) | +| `FetchSkipped` | True | Pod template hash unchanged, card data still valid | + +## Relationships + +``` +AgentRuntime (1) ---> (0..1) CardStatus (status.card) + | + +--> AgentCardData (embedded, reused from api/v1alpha1) + +--> Fetch metadata (fetchedAt, cardId, protocol) + +--> Verification fields (validSignature, attestedAgentSpiffeID, ...) + +AgentRuntime.spec.targetRef ---> Deployment/StatefulSet/Sandbox + | + +--> Pod selector labels ---> Service (selector match) + | + +--> /.well-known/agent-card.json +``` + +## State Transitions for status.card + +``` +[Empty] -- feature flag enabled + reconcile --> [Fetching] +[Fetching] -- success --> [Populated] (card data + metadata + verification) +[Fetching] -- failure --> [Empty or Stale] (retain last good data, CardSynced=False) +[Populated] -- pod template hash change --> [Fetching] (re-fetch) +[Populated] -- feature flag disabled --> [Empty] (cleared on next reconcile) +[Populated] -- workload deleted --> [Stale] (AgentRuntime still exists, card data stale) +``` diff --git a/specs/001-agentcard-into-status/plan.md b/specs/001-agentcard-into-status/plan.md new file mode 100644 index 00000000..b41a316b --- /dev/null +++ b/specs/001-agentcard-into-status/plan.md @@ -0,0 +1,74 @@ +# Implementation Plan: Consolidate AgentCard Data Into AgentRuntime Status + +**Branch**: `001-agentcard-into-status` | **Date**: 2026-05-21 | **Spec**: [spec.md](spec.md) +**Input**: Feature specification from `specs/001-agentcard-into-status/spec.md` + +## Summary + +Move A2A agent card discovery into the AgentRuntime controller's reconcile loop so operators can read card data, fetch metadata, and mTLS verification results from a single resource (`status.card` on AgentRuntime). Reuses the existing `agentcard.Fetcher` and `agentcard.AuthenticatedFetcher` interfaces and the `AgentCardData` struct. The feature is gated behind a `--enable-card-discovery` flag (default: disabled). AgentCard CRD remains functional with a deprecation warning. Identity binding policy and enforcement actions stay on AgentCard during coexistence (see Out of Scope in spec.md). + +## Technical Context + +**Language/Version**: Go 1.25, controller-runtime v0.23.3 +**Primary Dependencies**: controller-runtime, go-spiffe/v2, k8s.io/apimachinery +**Storage**: Kubernetes CRD status subresource (no external storage) +**Testing**: Ginkgo/Gomega (unit + integration), envtest for controller tests, e2e in `test/e2e/` +**Target Platform**: Kubernetes 1.31+ +**Project Type**: Kubernetes operator (kubebuilder-based) +**Performance Goals**: Card fetch adds < 1s to reconcile; 10s timeout on HTTP/mTLS request; no periodic re-fetch (event-driven only) +**Constraints**: No new CRDs; feature-gated; backward compatible with existing AgentCard workflows; change-detection hash stored as annotation (not in CRD status API surface) +**Scale/Scope**: Hundreds of AgentRuntimes per cluster (card fetch is 1:1 with AgentRuntime) + +## Constitution Check + +*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.* + +Constitution is a template (not customized for this project). No gates to evaluate. +Project follows standard kubebuilder patterns: CRD types in `api/v1alpha1/`, controllers in `internal/controller/`, shared logic in `internal/` packages. + +## Project Structure + +### Documentation (this feature) + +```text +specs/001-agentcard-into-status/ +├── plan.md # This file +├── research.md # Phase 0 output +├── data-model.md # Phase 1 output +├── contracts/ # Phase 1 output (CRD status contract) +├── REVIEWERS.md # Review guide for PR +└── tasks.md # Phase 2 output (/speckit.tasks) +``` + +### Source Code (repository root) + +```text +kagenti-operator/ +├── api/v1alpha1/ +│ ├── agentruntime_types.go # MODIFY: add CardStatus to AgentRuntimeStatus +│ ├── agentcard_types.go # READ-ONLY: reuse AgentCardData struct +│ └── zz_generated.deepcopy.go # REGENERATE: after type changes +├── internal/ +│ ├── agentcard/ +│ │ └── fetcher.go # READ-ONLY: reuse Fetcher, AuthenticatedFetcher, SpiffeFetcher +│ ├── controller/ +│ │ ├── agentruntime_controller.go # MODIFY: add card fetch phase to reconcile +│ │ ├── agentruntime_controller_test.go # MODIFY: add card fetch tests +│ │ ├── agentcard_controller.go # MODIFY: add deprecation warning log +│ │ └── agentcard_controller_test.go # MODIFY: test deprecation warning +│ └── signature/ # READ-ONLY: reuse VerificationResult, Provider +├── cmd/ +│ └── main.go # MODIFY: add --enable-card-discovery flag, wire fetchers +├── config/ +│ ├── crd/bases/ # REGENERATE: CRD manifests after type changes +│ └── rbac/ # MODIFY: add Service list/watch RBAC for agentruntime controller +└── test/ + ├── e2e/ # MODIFY: add card discovery e2e scenarios + └── integration/ # MODIFY: add card fetch integration tests +``` + +**Structure Decision**: Existing kubebuilder project structure. All changes extend existing files. No new packages or directories needed. Workload resolution covers Deployment, StatefulSet, and Sandbox (matching existing AgentCard controller patterns). + +## Complexity Tracking + +No constitution violations. No complexity justification needed. diff --git a/specs/001-agentcard-into-status/research.md b/specs/001-agentcard-into-status/research.md new file mode 100644 index 00000000..cfde7133 --- /dev/null +++ b/specs/001-agentcard-into-status/research.md @@ -0,0 +1,72 @@ +# Research: Consolidate AgentCard Data Into AgentRuntime Status + +## R1: Service Endpoint Resolution Strategy + +**Decision**: Selector matching. Resolve the workload's Pod selector labels, list Services in the same namespace, find the first Service whose selector matches. Applies to all supported workload types (Deployment, StatefulSet, Sandbox). + +**Rationale**: Standard Kubernetes pattern. Works automatically without user annotations. The existing AgentCard controller uses a simpler convention (Service name = workload name via `workload.ServiceName`), but selector matching is more robust for cases where Service and workload names diverge. + +**Alternatives considered**: +- Naming convention (Service name = workload name): simpler but brittle when names differ +- Annotation-driven: more flexible but adds configuration burden +- Hybrid (selector match + annotation override): future enhancement if needed + +**Implementation note**: The existing `AgentCardReconciler.getWorkload()` sets `ServiceName: targetRef.Name` (line 526 of agentcard_controller.go). For the AgentRuntime controller, we should match this convention initially (use the workload name as the Service name) since it aligns with how Services are typically created for agent workloads. If no Service matches by name, fall back to selector matching. + +## R2: Card Fetch Trigger Mechanism + +**Decision**: Pod template hash change detection. The AgentRuntime controller already watches workloads and reconciles on changes. The card fetch phase checks whether the workload's pod-template-hash (or generation for StatefulSets/Sandboxes) has changed since the last successful fetch by comparing against a hash stored in an annotation (`agent.kagenti.dev/last-card-fetch-hash`), not in the CRD status API surface. + +**Rationale**: Avoids unnecessary HTTP calls on every reconcile. Pod template hash changes correlate with actual code/config changes that could affect the agent card. The AgentRuntime controller already reconciles on Deployment changes, so no new watches needed. + +**Alternatives considered**: +- Periodic polling (SyncPeriod): wastes resources when nothing changed +- Generation-based: doesn't capture all relevant changes +- Always fetch: simpler but wasteful at scale + +## R3: Reusable Code from AgentCard Controller and PR #284 + +**Decision**: Reuse the following components directly: + +| Component | Package | Reuse strategy | +|-----------|---------|----------------| +| `agentcard.Fetcher` interface | `internal/agentcard` | Direct reuse, inject into AgentRuntime reconciler | +| `agentcard.AuthenticatedFetcher` interface | `internal/agentcard` | Direct reuse for mTLS path | +| `agentcard.SpiffeFetcher` | `internal/agentcard` | Direct reuse, same X509Source | +| `agentcard.ConfigMapFetcher` | `internal/agentcard` | Direct reuse for signed card ConfigMap path | +| `AgentCardData` struct | `api/v1alpha1` | Embed in new `CardStatus` struct | +| `signature.Provider` interface | `internal/signature` | Direct reuse for JWS verification | +| `signature.VerificationResult` | `internal/signature` | Direct reuse for verification fields | +| `doHTTPFetch()` | `internal/agentcard` | Already shared between fetchers | +| `extractSpiffeIDFromTLS()` | `internal/agentcard` | Already used by SpiffeFetcher | + +**Rationale**: The entire fetch, parse, and verify pipeline already exists. The AgentRuntime controller just needs to call the same interfaces and write results to a different status struct. + +## R4: Feature Flag Design + +**Decision**: Add `--enable-card-discovery` flag to the operator binary (cmd/main.go). When disabled, the card fetch phase in the reconcile loop is a no-op. When toggled off, the reconciler clears `status.card` on the next reconcile. + +**Rationale**: Follows the existing pattern of `--enable-verified-fetch` flag on the AgentCard controller. Simple boolean flag, no runtime reconfiguration needed. + +**Implementation note**: The flag controls whether the `Fetcher` and `AuthenticatedFetcher` are injected into the AgentRuntime reconciler at startup. When disabled, the fields are nil and the card fetch phase short-circuits. + +## R5: CardStatus Struct Design + +**Decision**: New `CardStatus` struct wrapping `AgentCardData` with fetch metadata and verification fields. Placed in `api/v1alpha1/agentruntime_types.go`. + +**Rationale**: Keeps card payload separate from fetch/verification metadata. The `AgentCardData` struct is reused as-is (no duplication). Fetch metadata and verification fields are added alongside, not mixed into the card payload. + +**Fields**: +- Card payload: embedded `AgentCardData` (name, description, version, url, skills, capabilities, etc.) +- Fetch metadata: `fetchedAt` (timestamp), `cardId` (SHA-256 content hash), `protocol` (detected agent protocol) +- Verification: `validSignature` (bool), `signatureKeyID`, `attestedAgentSpiffeID`, `signatureVerificationDetails` + +**Note (from review feedback)**: `lastPodTemplateHash` was originally in this struct but moved to an annotation (`agent.kagenti.dev/last-card-fetch-hash`) to avoid coupling the change-detection mechanism to the public API surface. + +## R6: Deprecation Warning Implementation + +**Decision**: Add a log warning at Info level in `AgentCardReconciler.Reconcile()` when processing a newly created AgentCard (check if `agentCard.CreationTimestamp` is within the last reconcile window). Also emit a Kubernetes Event. + +**Rationale**: Non-intrusive. Operators see it in logs and events without any behavior change. The AgentCard controller continues to function normally. + +**Implementation note**: Check `agentCard.CreationTimestamp.After(time.Now().Add(-5 * time.Minute))` as a heuristic for "recently created". Log once per new card, not on every reconcile. diff --git a/specs/001-agentcard-into-status/spec.md b/specs/001-agentcard-into-status/spec.md new file mode 100644 index 00000000..67f7a4a2 --- /dev/null +++ b/specs/001-agentcard-into-status/spec.md @@ -0,0 +1,163 @@ +# Feature Specification: Consolidate AgentCard Data Into AgentRuntime Status + +**Feature Branch**: `001-agentcard-into-status` +**Created**: 2026-05-21 +**Status**: Draft +**Input**: Brainstorm document `brainstorm/01-agentcard-into-agentruntime.md` + +## Clarifications + +### Session 2026-05-21 + +- Q: How should the controller discover the Service endpoint for a given AgentRuntime's targetRef workload? → A: Selector match: resolve the workload's Pod selector, find Services whose selector matches, use the first match. +- Q: What happens to previously populated status.card data when a card fetch fails? → A: Retain last successful card data; rely on the CardSynced condition and fetchedAt timestamp to signal staleness. +- Q: What happens to existing status.card data when the feature flag is toggled off? → A: Clear status.card on the next reconcile of each AgentRuntime when the flag is disabled. +- Q: What should status.card contain, given mTLS is in scope? → A: Card payload fields, fetch metadata (fetchedAt, cardId, protocol), and verification fields (signature validation, SPIFFE identity). mTLS reuses infrastructure from PR #284. + +## User Scenarios & Testing *(mandatory)* + +### User Story 1 - Discover Agent Capabilities from AgentRuntime (Priority: P1) + +A platform operator queries a single resource (AgentRuntime) to see what an agent can do, including its name, description, skills, supported protocols, and endpoint URL. Today this requires cross-referencing AgentRuntime and AgentCard, which doubles the lookup effort and makes scripting harder. + +**Why this priority**: This is the core value of the consolidation. Without card data surfaced on AgentRuntime, the entire feature has no observable benefit. + +**Independent Test**: Deploy an agent workload with a valid `/.well-known/agent-card.json` endpoint, create an AgentRuntime targeting it, and confirm `status.card` is populated with the card data via `kubectl get agentruntime -o yaml`. + +**Acceptance Scenarios**: + +1. **Given** an AgentRuntime targeting a workload (Deployment, StatefulSet, or Sandbox) whose Pods serve a valid A2A agent card at `/.well-known/agent-card.json`, **When** the controller reconciles the AgentRuntime, **Then** `status.card` contains the agent's name, description, skills, capabilities, endpoint URL, fetchedAt timestamp, cardId content hash, and detected protocol. +2. **Given** an AgentRuntime whose `status.card` is already populated, **When** the backing workload rolls out a new Pod template (hash change or generation change), **Then** the controller re-fetches the card and updates `status.card` with the new content. +3. **Given** an AgentRuntime targeting a workload, **When** the agent's card endpoint is unreachable or returns invalid JSON, **Then** `status.card` retains the last successfully fetched data and a `CardSynced` condition indicates the fetch failure with a human-readable reason. + +--- + +### User Story 2 - Verified Card Discovery via mTLS (Priority: P1) + +A platform operator deploying agents with SPIFFE identity (via SPIRE) can see the card's signature verification status and the attested SPIFFE ID directly on the AgentRuntime, confirming the card was fetched over a verified mTLS connection and its JWS signature is valid. + +**Why this priority**: Identity verification is a core platform security requirement. Reusing the mTLS infrastructure from PR #284 makes this achievable without significant new code. + +**Independent Test**: Deploy an agent with SPIRE identity configured, create an AgentRuntime with mTLS mode enabled, and verify `status.card` includes signature validation result and attested SPIFFE ID. + +**Acceptance Scenarios**: + +1. **Given** an AgentRuntime with mTLS enabled and a backing agent that presents a valid SPIFFE certificate, **When** the controller fetches the card over mTLS, **Then** `status.card` includes the attested SPIFFE ID extracted from the peer certificate and the signature validation result. +2. **Given** an AgentRuntime with mTLS enabled and a backing agent whose SPIFFE certificate does not match the expected trust domain, **When** the controller attempts to fetch the card, **Then** the fetch fails, `status.card` retains stale data, and the `CardSynced` condition reports an identity verification failure. +3. **Given** an AgentRuntime without mTLS configured (or mTLS disabled), **When** the controller fetches the card, **Then** the fetch uses plain HTTP and verification fields in `status.card` remain empty. + +--- + +### User Story 3 - Deprecation Warning on AgentCard Creation (Priority: P2) + +A platform operator who still creates AgentCard CRs receives a clear deprecation warning so they know to migrate to the AgentRuntime-based discovery path. + +**Why this priority**: Backward compatibility is essential during the transition period. Operators need a signal to migrate without breaking existing workflows. + +**Independent Test**: Create an AgentCard CR and check controller logs for the deprecation warning message. + +**Acceptance Scenarios**: + +1. **Given** the operator is running with card discovery enabled, **When** a new AgentCard CR is created, **Then** the controller emits a deprecation log warning indicating that AgentCard is deprecated and card data should be consumed from AgentRuntime `status.card`. +2. **Given** an existing AgentCard CR, **When** the controller reconciles it, **Then** the AgentCard continues to function normally (both CRDs coexist). + +--- + +### User Story 4 - Feature-Gated Card Discovery (Priority: P2) + +A cluster administrator controls whether the new card discovery behavior is active via a feature flag, allowing gradual rollout without code changes. + +**Why this priority**: Operators need to opt in during the transition period. Disabling by default prevents surprises for existing installations. + +**Independent Test**: Start the operator with the feature flag disabled and verify no card fetch occurs; enable the flag and verify card fetching activates. + +**Acceptance Scenarios**: + +1. **Given** the operator is started without enabling the card discovery feature flag, **When** an AgentRuntime is reconciled, **Then** no card fetch is attempted and `status.card` remains empty. +2. **Given** the operator is started with the card discovery feature flag enabled, **When** an AgentRuntime is reconciled, **Then** the controller fetches `/.well-known/agent-card.json` from the agent's Service endpoint and populates `status.card`. +3. **Given** an AgentRuntime with populated `status.card` data, **When** the operator restarts with the feature flag disabled, **Then** `status.card` is cleared on the next reconcile of each AgentRuntime. + +--- + +### Edge Cases + +- What happens when the agent's Service has multiple ports? The controller uses selector matching to find the Service, then targets the port serving the A2A protocol (by well-known port name or the first HTTP port). +- How does the system handle a card endpoint that returns a valid JSON response but not a valid A2A agent card structure? The controller treats it as a fetch failure, retains any previously fetched data, and surfaces the parsing error in the `CardSynced` condition. +- What happens when the backing workload has zero ready Pods? The controller skips the card fetch and sets a condition indicating the workload is not ready. +- What happens if the card response is excessively large? The controller enforces a size limit on the response body (1 MiB) to prevent resource exhaustion. +- What happens when no Service matches the workload's Pod selector? The controller sets the `CardSynced` condition to false with reason "ServiceNotFound" and skips the fetch. +- What happens when the mTLS handshake fails (e.g., certificate expired, wrong trust domain)? The controller retains stale card data and reports the TLS error in the `CardSynced` condition. +- What happens if the card fetch hangs or is slow? The controller enforces a 10-second timeout on the HTTP/mTLS request. Timeout is treated as a fetch failure (stale data retained, `CardSynced=False`). +- What happens when an agent updates its card content without a workload rollout (e.g., hot-reloading skills)? The controller does not detect this. Card data in `status.card` reflects the state at the last rollout. This is an intentional constraint: event-driven fetch (no polling) trades freshness of dynamic card changes for reduced API server and network load. + +## Requirements *(mandatory)* + +### Functional Requirements + +- **FR-001**: The system MUST add a `card` field to AgentRuntime status that holds: A2A agent card payload (name, description, version, URL, skills, capabilities, provider, input/output modes), fetch metadata (fetchedAt timestamp, cardId content hash, detected protocol), and verification fields (signature validation result, attested SPIFFE ID). +- **FR-002**: The system MUST fetch the agent card from `/.well-known/agent-card.json` on the agent's Service endpoint when reconciling an AgentRuntime. +- **FR-003**: The system MUST trigger card re-fetch only on Pod template hash changes (rollout events), not on a periodic timer. +- **FR-004**: The system MUST record a `CardSynced` condition on AgentRuntime status indicating the result of the last fetch attempt (success, failure with reason, or skipped). +- **FR-005**: The system MUST gate the card fetch behavior behind a feature flag that defaults to disabled. +- **FR-006**: The system MUST emit a deprecation log warning when a new AgentCard CR is created. +- **FR-007**: The system MUST record a `fetchedAt` timestamp in `status.card` so operators can see when the card data was last refreshed. +- **FR-008**: The system MUST resolve the agent's Service endpoint by matching the workload's Pod selector labels to Services in the same namespace (selector match), using the first matching Service. This applies to all supported workload types (Deployment, StatefulSet, Sandbox). +- **FR-009**: The system MUST enforce a maximum response body size when fetching the card to prevent resource exhaustion. +- **FR-010**: The system MUST support mTLS for the card fetch, reusing the SPIFFE/SPIRE infrastructure from PR #284. When mTLS is configured on the AgentRuntime, the controller uses the workload's SVID to establish a verified connection. +- **FR-011**: The system MUST populate verification fields in `status.card` (attested SPIFFE ID, signature validation result) when the card is fetched over mTLS and the card contains JWS signatures. +- **FR-012**: The system MUST clear `status.card` when the feature flag is disabled, on the next reconcile of each AgentRuntime. +- **FR-013**: The system MUST retain the last successfully fetched card data when a fetch attempt fails, relying on the `CardSynced` condition and `fetchedAt` timestamp to signal staleness. + +### Key Entities + +- **AgentRuntime**: Existing CRD that attaches runtime configuration to a workload. Extended with `status.card` to hold discovered agent card data, fetch metadata, and verification results. +- **AgentCard (existing, deprecated)**: Existing CRD that caches A2A card data. Continues to function but emits deprecation warnings. Will be removed in a future iteration. +- **A2A Agent Card**: The JSON document served by agents at `/.well-known/agent-card.json` per the A2A protocol specification. Source of truth for card data. + +## Success Criteria *(mandatory)* + +### Measurable Outcomes + +- **SC-001**: Operators can retrieve agent capabilities (name, skills, endpoint) and identity verification status from a single `kubectl get agentruntime` command instead of cross-referencing two resources. +- **SC-002**: Card data in `status.card` reflects the running agent's actual capabilities within one reconciliation cycle after a rollout. +- **SC-003**: Existing AgentCard-based workflows continue to function without modification during the transition period. +- **SC-004**: The feature can be enabled or disabled at operator startup without redeployment of agent workloads. Disabling clears stale card data from all AgentRuntimes. +- **SC-005**: When mTLS is configured, the card fetch verifies the agent's SPIFFE identity and validates JWS signatures, with results visible in `status.card`. + +## Out of Scope (with migration path) + +The following are explicitly out of scope for this iteration. Each item includes the intended migration path so the deprecation trajectory is visible. + +### Identity binding policy (`spec.identityBinding`) + +Identity binding is a **workload identity policy** that validates whether an agent's SPIFFE ID belongs to a configured trust domain. It is orthogonal to card discovery: the card fetch is merely the mechanism that surfaces the SPIFFE ID, but the binding evaluation and enforcement are about the workload, not the card content. + +**Current home**: `AgentCard.spec.identityBinding` (trustDomain, strict) +**Intended destination**: `AgentRuntime.spec` in a follow-up iteration, alongside the existing `spec.identity.spiffe.trustDomain` field. +**During coexistence**: Identity binding policy stays on AgentCard. The AgentCard controller continues to evaluate binding and propagate the `signature-verified` label. No enforcement behavior changes. +**Brainstorm**: See `brainstorm/02-identity-binding-migration.md` for detailed analysis. + +### Enforcement actions (label propagation, NetworkPolicy) + +The current AgentCard controller propagates the `signature-verified` label to workloads based on identity verification results. The NetworkPolicy controller uses this label to gate inter-agent traffic. + +**This PR**: `status.card` on AgentRuntime is purely observational. It surfaces verification results but does not drive enforcement actions. +**During coexistence**: The AgentCard controller continues to handle all enforcement (label propagation, NetworkPolicy). No enforcement behavior changes. +**Future iteration**: When identity binding moves to AgentRuntime.spec, the enforcement logic (label propagation) moves to the AgentRuntime controller. This is a separate spec. + +### AgentCardSyncReconciler + +The `AgentCardSyncReconciler` auto-creates AgentCard CRs for labelled agent workloads. It continues to function during coexistence. Its deprecation and removal will be part of the AgentCard CRD removal iteration (after identity binding migration is complete). + +### AgentCard CRD removal and migration tooling + +Full CRD removal, ValidatingAdmissionPolicy for label restriction, and migration tooling are deferred until identity binding and enforcement have migrated to AgentRuntime. + +## Assumptions + +- Each AgentRuntime targets exactly one workload (Deployment, StatefulSet, or Sandbox), and there is at most one Service matching that workload's Pod selector in the same namespace. +- The card fetch adds negligible latency to the reconcile loop (single HTTP/mTLS GET, typically sub-second). +- The existing `AgentCardData` Go struct (already defined in the codebase) can be extended or wrapped to include fetch metadata and verification fields for `status.card`. +- IBM maintainers have agreed to the AgentCard deprecation path (confirmed 2026-05-15 per brainstorm context). +- The mTLS verified fetch infrastructure from PR #284 (merged 2026-05-20) is stable and reusable for the AgentRuntime controller's card fetch. +- Agents with SPIRE identity configured already have SVIDs available via spiffe-helper in the operator Pod or via the SPIRE agent socket. diff --git a/specs/001-agentcard-into-status/tasks.md b/specs/001-agentcard-into-status/tasks.md new file mode 100644 index 00000000..4b1bc952 --- /dev/null +++ b/specs/001-agentcard-into-status/tasks.md @@ -0,0 +1,202 @@ +# Tasks: Consolidate AgentCard Data Into AgentRuntime Status + +**Input**: Design documents from `specs/001-agentcard-into-status/` +**Prerequisites**: plan.md, spec.md, research.md, data-model.md, contracts/ + +**Tests**: Test tasks are included since this is a controller change requiring unit and integration test coverage. + +**Organization**: Tasks are grouped by user story to enable independent implementation and testing of each story. + +## Format: `[ID] [P?] [Story] Description` + +- **[P]**: Can run in parallel (different files, no dependencies) +- **[Story]**: Which user story this task belongs to (e.g., US1, US2, US3) +- Include exact file paths in descriptions + +## Phase 1: Setup + +**Purpose**: CRD type changes and code generation that all stories depend on + +- [X] T001 Add `CardStatus` struct to `AgentRuntimeStatus` in `kagenti-operator/api/v1alpha1/agentruntime_types.go`. The `CardStatus` struct embeds `AgentCardData` (reuse existing struct) and adds: `FetchedAt *metav1.Time`, `CardId string`, `Protocol string`, `ValidSignature *bool`, `SignatureKeyID string`, `SignatureVerificationDetails string`, `AttestedAgentSpiffeID string`. The change-detection hash is stored as an annotation (`agent.kagenti.dev/last-card-fetch-hash`), not in the struct. Add the `Card *CardStatus` field to `AgentRuntimeStatus`. Add `CardSynced` as a new condition type constant. +- [X] T002 Run `make generate` and `make manifests` in `kagenti-operator/` to regenerate deepcopy functions and CRD manifests. Verify `zz_generated.deepcopy.go` has the new `CardStatus` deepcopy method and `config/crd/bases/` has the updated AgentRuntime CRD. +- [X] T003 Add `--enable-card-discovery` boolean flag (default: false) to `kagenti-operator/cmd/main.go`. When enabled, create `agentcard.NewConfigMapFetcher()` and `agentcard.NewSpiffeFetcher()` (conditional on SPIRE config), and inject them into the `AgentRuntimeReconciler` as new fields: `AgentFetcher agentcard.Fetcher`, `AuthenticatedFetcher agentcard.AuthenticatedFetcher`, `SignatureProvider signature.Provider`, `EnableCardDiscovery bool`, `SpireTrustDomain string`. Follow the existing pattern of `--enable-verified-fetch` flag wiring for the AgentCard controller. + +**Checkpoint**: CRD types updated, code generated, flag wired. Ready for controller changes. + +--- + +## Phase 2: Foundational (Blocking Prerequisites) + +**Purpose**: Service resolution helper and RBAC that all card discovery stories need + +**CRITICAL**: No user story work can begin until this phase is complete + +- [X] T004 Add a `resolveServiceForWorkload` method to `AgentRuntimeReconciler` in `kagenti-operator/internal/controller/agentruntime_controller.go`. Given a namespace and Deployment name, first try to get a Service with the same name (matching existing AgentCard convention). If not found, list Services in the namespace, match by Pod selector labels from the Deployment, and return the first match. Return the Service object and the selected port (first HTTP port, or default 8000). Also add `getAgentTLSPort` (reuse logic from `agentcard_controller.go` line 684). +- [X] T005 [P] Add Service `get;list;watch` RBAC for the agentruntime controller in `kagenti-operator/internal/controller/agentruntime_controller.go` via kubebuilder RBAC markers: `// +kubebuilder:rbac:groups=core,resources=services,verbs=get;list;watch`. Run `make manifests` to update `config/rbac/`. + +**Checkpoint**: Service resolution and RBAC ready. User story implementation can begin. + +--- + +## Phase 3: User Story 1 - Discover Agent Capabilities from AgentRuntime (Priority: P1) MVP + +**Goal**: AgentRuntime `status.card` is populated with A2A card data fetched from the agent's Service endpoint on rollout events. Operators can read card data from a single resource. + +**Independent Test**: Deploy an agent with a valid `/.well-known/agent-card.json`, create an AgentRuntime, and confirm `status.card` is populated via `kubectl get agentruntime -o yaml`. + +### Tests for User Story 1 + +- [X] T006 [P] [US1] Add unit tests for `resolveServiceForWorkload` in `kagenti-operator/internal/controller/agentruntime_controller_test.go`. Test cases: Service found by name, Service found by selector match, no matching Service, multiple ports (uses first HTTP port), Deployment with no ready pods. +- [X] T007 [P] [US1] Add unit tests for the card fetch phase in `kagenti-operator/internal/controller/agentruntime_controller_test.go`. Test cases: successful fetch populates `status.card` with all fields, fetch failure retains stale data and sets `CardSynced=False`, invalid JSON sets `CardParseFailed` condition, feature flag disabled skips fetch, pod template hash unchanged skips fetch, feature flag toggled off clears `status.card`. + +### Implementation for User Story 1 + +- [X] T008 [US1] Add a `fetchAndUpdateCard` method to `AgentRuntimeReconciler` in `kagenti-operator/internal/controller/agentruntime_controller.go`. This is the main card fetch phase, called from `Reconcile()` after step 5 (config hash). Logic: (1) If `EnableCardDiscovery` is false, clear `status.card` if populated, set `CardSynced` condition to `CardDiscoveryDisabled`, return. (2) Read the change-detection hash from annotation `agent.kagenti.dev/last-card-fetch-hash` on the AgentRuntime; compare against current workload pod template hash (or generation for StatefulSet/Sandbox); if unchanged and `status.card` is populated, set `CardSynced` to `FetchSkipped`, return. (3) Call `resolveServiceForWorkload`. (4) Build service URL via `agentcard.GetServiceURL`. (5) Call `AgentFetcher.Fetch()`. (6) On success: build `CardStatus` from fetched `AgentCardData`, set `fetchedAt`, compute `cardId` via SHA-256, set `protocol`, store current hash in annotation. (7) On failure: retain existing `status.card`, set `CardSynced=False` with error reason. (8) Update status via `r.Status().Update()` with retry. Note: FR-009 (max response body size) is implicitly satisfied because the reused `doHTTPFetch()` in `internal/agentcard/fetcher.go` already enforces `maxCardSize = 1 MiB`. +- [X] T009 [US1] Wire the `fetchAndUpdateCard` call into the `Reconcile()` method in `kagenti-operator/internal/controller/agentruntime_controller.go`. Insert after the config hash computation (step 5, around line 165) and before the label propagation phase. Pass the resolved target workload info. +- [X] T010 [US1] Extract the change-detection key from the target workload in the `fetchAndUpdateCard` method. For Deployments, read the pod-template-hash label. For StatefulSets and Sandboxes, use the resource generation. Store the key in the `agent.kagenti.dev/last-card-fetch-hash` annotation on the AgentRuntime object (not in status). + +**Checkpoint**: US1 complete. `status.card` is populated on reconcile when the flag is enabled. + +--- + +## Phase 4: User Story 2 - Verified Card Discovery via mTLS (Priority: P1) + +**Goal**: When mTLS is configured, the card fetch uses the SPIFFE/SPIRE infrastructure from PR #284 to establish a verified connection. Verification results (attested SPIFFE ID, signature validation) are surfaced in `status.card`. + +**Independent Test**: Deploy an agent with SPIRE identity, enable card discovery and mTLS on the AgentRuntime, and verify `status.card` includes `attestedAgentSpiffeID` and `validSignature` fields. + +### Tests for User Story 2 + +- [X] T011 [P] [US2] Add unit tests for mTLS card fetch in `kagenti-operator/internal/controller/agentruntime_controller_test.go`. Test cases: mTLS fetch populates `attestedAgentSpiffeID` and `validSignature`, mTLS handshake failure retains stale data and sets condition, fallback to HTTP when no TLS port, JWS signature verification populates `signatureKeyID`. + +### Implementation for User Story 2 + +- [X] T012 [US2] Extend `fetchAndUpdateCard` in `kagenti-operator/internal/controller/agentruntime_controller.go` to support mTLS. After resolving the Service, check for the `agent-tls` named port (reuse `getAgentTLSPort`). If present and `AuthenticatedFetcher` is not nil, call `AuthenticatedFetcher.FetchAuthenticated()` instead of `AgentFetcher.Fetch()`. On success, populate `attestedAgentSpiffeID` from `FetchResult.AgentSpiffeID`. +- [X] T013 [US2] Add JWS signature verification to `fetchAndUpdateCard` in `kagenti-operator/internal/controller/agentruntime_controller.go`. After fetching the card data (mTLS or HTTP), if `SignatureProvider` is not nil and the card has signatures, call `SignatureProvider.VerifySignature()`. Populate `status.card.validSignature`, `status.card.signatureKeyID`, and `status.card.signatureVerificationDetails` from the `VerificationResult`. +- [X] T014 [US2] Handle mTLS fallback to HTTP in `fetchAndUpdateCard`. If `AuthenticatedFetcher` is set but no `agent-tls` port exists on the Service, fall back to `AgentFetcher.Fetch()` and leave verification fields empty. Log a warning and emit a Kubernetes Event (reuse pattern from agentcard_controller.go line 351-356). + +**Checkpoint**: US2 complete. mTLS verified card fetch works with SPIFFE identity extraction and JWS signature validation. + +--- + +## Phase 5: User Story 3 - Deprecation Warning on AgentCard Creation (Priority: P2) + +**Goal**: When a new AgentCard CR is created, the controller emits a deprecation log warning. Existing AgentCards continue to function normally. + +**Independent Test**: Create an AgentCard CR and check controller logs for the deprecation message. + +### Tests for User Story 3 + +- [X] T015 [P] [US3] Add unit test for deprecation warning in `kagenti-operator/internal/controller/agentcard_controller_test.go`. Verify that reconciling a recently created AgentCard emits a deprecation log message and Kubernetes Event. + +### Implementation for User Story 3 + +- [X] T016 [US3] Add deprecation warning to `AgentCardReconciler.Reconcile()` in `kagenti-operator/internal/controller/agentcard_controller.go`. After the deletion/finalizer check (around line 166), check if `agentCard.CreationTimestamp` is within the last 5 minutes. If so, log a warning: "AgentCard is deprecated; card data is now available via AgentRuntime status.card. Migrate to AgentRuntime-based discovery." Also emit a Kubernetes Event with reason "Deprecated" and type Warning. + +**Checkpoint**: US3 complete. Deprecation warnings emitted for new AgentCard CRs. + +--- + +## Phase 6: User Story 4 - Feature-Gated Card Discovery (Priority: P2) + +**Goal**: The card discovery behavior is fully controlled by the `--enable-card-discovery` flag. Disabling the flag clears stale card data. + +**Independent Test**: Start the operator with the flag disabled, verify no card fetch. Enable the flag, verify card fetch works. Disable again, verify `status.card` is cleared. + +### Tests and Verification for User Story 4 + +- [X] T017 [US4] Add unit tests for feature flag toggle behavior in `kagenti-operator/internal/controller/agentruntime_controller_test.go`. Test cases: (1) flag disabled means no card fetch attempted and `status.card` remains nil, (2) flag disabled clears existing populated `status.card` data and sets `CardSynced` condition to `CardDiscoveryDisabled`, (3) flag enabled triggers fetch and populates `status.card`, (4) toggling flag off after previous population clears data on next reconcile. These tests validate the flag-off cleanup logic built in T008 step 1. + +**Checkpoint**: US4 complete. Feature flag fully controls card discovery lifecycle. + +--- + +## Phase 7: Polish and Cross-Cutting Concerns + +**Purpose**: End-to-end validation, documentation, and CRD manifest finalization + +- [X] T018 [P] Run `make generate && make manifests` in `kagenti-operator/` to ensure all generated code and CRD manifests are up to date after all changes. +- [X] T019 [P] Add a `CardSynced` print column to the AgentRuntime CRD in `kagenti-operator/api/v1alpha1/agentruntime_types.go` via kubebuilder marker: `// +kubebuilder:printcolumn:name="CardSynced",type="string",JSONPath=".status.conditions[?(@.type=='CardSynced')].status",description="Card Sync Status"`. Regenerate manifests. +- [ ] T020 [P] Add e2e test scenario to `kagenti-operator/test/e2e/e2e_test.go` that deploys a test agent Deployment with a mock `/.well-known/agent-card.json` endpoint, creates an AgentRuntime targeting it with card discovery enabled, and verifies `status.card` is populated within 30 seconds. +- [ ] T021 Run full test suite: `make test` in `kagenti-operator/`. Fix any regressions. Ensure existing AgentCard controller tests still pass. +- [ ] T022 Verify CRD backward compatibility: confirm existing AgentRuntime CRs without `status.card` continue to work without errors when the operator runs with card discovery disabled. + +--- + +## Dependencies and Execution Order + +### Phase Dependencies + +- **Phase 1 (Setup)**: No dependencies, start immediately +- **Phase 2 (Foundational)**: Depends on Phase 1 (T001-T003 must complete) +- **Phase 3 (US1)**: Depends on Phase 2 (T004-T005 must complete) +- **Phase 4 (US2)**: Depends on Phase 3 (T008-T010 must complete, extends `fetchAndUpdateCard`) +- **Phase 5 (US3)**: Can start after Phase 1 (independent of US1/US2, different controller file) +- **Phase 6 (US4)**: Depends on Phase 3 (validates flag behavior of `fetchAndUpdateCard`) +- **Phase 7 (Polish)**: Depends on all user stories completing + +### User Story Dependencies + +- **US1 (P1)**: After Foundational. No dependencies on other stories. +- **US2 (P1)**: After US1. Extends `fetchAndUpdateCard` with mTLS path. +- **US3 (P2)**: After Setup only. Independent (modifies `agentcard_controller.go`, not `agentruntime_controller.go`). +- **US4 (P2)**: After US1. Validates flag toggle behavior already built in US1. + +### Within Each User Story + +- Tests written first, then implementation +- Service resolution before card fetch +- Plain HTTP fetch before mTLS extension +- Commit after each task or logical group + +### Parallel Opportunities + +- T006 and T007 can run in parallel (different test scenarios, same file) +- T011 and T015 can run in parallel (different controller test files) +- T005 can run in parallel with T004 (RBAC markers vs service resolution logic) +- T019, T020, T021 can run in parallel (different files) +- US3 (Phase 5) can run in parallel with US1 (Phase 3) since they modify different controllers + +--- + +## Parallel Example: User Story 1 + +```bash +# Launch US1 tests in parallel: +Task: "Unit tests for resolveServiceForWorkload in agentruntime_controller_test.go" +Task: "Unit tests for card fetch phase in agentruntime_controller_test.go" + +# US3 can run in parallel with US1 (different controller files): +Task: "Deprecation warning in agentcard_controller.go" +``` + +--- + +## Implementation Strategy + +### MVP First (User Story 1 Only) + +1. Complete Phase 1: Setup (CRD types + flag) +2. Complete Phase 2: Foundational (service resolution + RBAC) +3. Complete Phase 3: User Story 1 (plain HTTP card fetch) +4. **STOP and VALIDATE**: Test card discovery with a real agent deployment +5. Deploy/demo if ready + +### Incremental Delivery + +1. Setup + Foundational: CRD and infrastructure ready +2. Add US1 (plain HTTP card fetch): Test independently, deploy (MVP!) +3. Add US2 (mTLS verification): Test with SPIRE, deploy +4. Add US3 (deprecation warning): Independent, can ship alongside US1 +5. Add US4 (flag toggle validation): Confirms flag lifecycle +6. Polish: e2e tests, print columns, backward compatibility check + +--- + +## Notes + +- [P] tasks = different files, no dependencies +- [Story] label maps task to specific user story for traceability +- Reuses existing `agentcard.Fetcher`, `agentcard.SpiffeFetcher`, `signature.Provider` interfaces (no new packages) +- All new code goes in existing files (no new Go source files) +- CRD change is additive (status field only), no API version bump needed