diff --git a/CLAUDE.md b/CLAUDE.md index 31dbfd4d..bd6207c7 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -45,3 +45,9 @@ Orchestrate skills for enhancing related repositories: | `skills:scan` | Audit repository skills | | `skills:write` | Create or edit skills following the standard | | `skills:validate` | Validate skill format and structure | + + +For additional context about technologies to be used, project structure, +shell commands, and other important information, read the current plan +at specs/003-mtls-transport-security/plan.md + diff --git a/specs/003-mtls-transport-security/REVIEWERS.md b/specs/003-mtls-transport-security/REVIEWERS.md new file mode 100644 index 00000000..300bdfeb --- /dev/null +++ b/specs/003-mtls-transport-security/REVIEWERS.md @@ -0,0 +1,85 @@ +# Review Guide: mTLS Transport Security for Agent Communication + +**Generated**: 2026-06-08 | **Spec**: [spec.md](spec.md) + +## Why This Change + +Agent-to-agent and controller-to-agent communication in kagenti currently runs over plaintext HTTP by default. While authbridge already has full mTLS support implemented (permissive/strict modes, SPIRE-based SVIDs, per-handshake cert rotation), the operator doesn't activate it by default. Operators must manually set flags and configure mTLS mode. This spec makes mTLS the default transport security, with clear error conditions when SPIRE is unavailable. + +## What Changes + +1. **mTLS enabled by default**: `mTLSMode` defaults to `permissive` (was implicitly `disabled`). Agents communicate over mTLS automatically when SPIRE is deployed. +2. **MTLSReady condition**: New status condition on AgentRuntime showing whether mTLS infrastructure (SPIRE) is available, with actionable error messages when it's not. +3. **Controller uses mTLS by default**: `--enable-verified-fetch` and `--enable-card-discovery` flags flip to `true`. SpiffeFetcher becomes the default card fetcher. +4. **JWS signing deprecation**: Legacy signing flags (`--require-a2a-signature`, `--signature-audit-mode`, `--enforce-network-policies`) emit deprecation warnings. +5. **Annotation-based mTLS delivery**: Controller sets `kagenti.io/mtls-mode` annotation on pod template (triggers restart on change). Webhook reads `mTLSMode` from AgentRuntime CR at pod CREATE and sets `MTLS_MODE` env var on authbridge container. + +No breaking changes. Existing deployments without SPIRE get a clear `MTLSReady=False` condition and can opt out with `mTLSMode: disabled`. + +## How It Works + +The implementation leverages heavily what's already built: + +- **Authbridge** (kagenti-extensions): mTLS is fully implemented across all proxy modes. No changes needed, only verification. +- **Operator**: The main work is (a) changing the `mTLSMode` kubebuilder default to `permissive`, (b) setting `kagenti.io/mtls-mode` annotation on the pod template, (c) adding `MTLSReady` condition logic, (d) flipping flag defaults, and (e) webhook sets `MTLS_MODE` env var on authbridge container. +- **SPIRE detection**: The controller checks whether spiffe-helper volume mounts exist in the workload's pod template. If absent while mTLS is enabled, `MTLSReady=False/SPIREUnavailable`. +- **Rolling restart**: When `mTLSMode` changes, the `kagenti.io/mtls-mode` annotation on the pod template changes, triggering a Kubernetes rolling restart. This is independent of the platform config hash (per PR #405). + +Of 24 tasks, 2 are already done (`[DONE]`), 1 is partial (`[PARTIAL]`), and the remaining 21 are new work, mostly focused on the operator side. + +## When It Applies + +**Applies when**: +- Deploying agents in clusters with SPIRE +- Controller fetching agent cards from live workloads +- Setting or changing `mTLSMode` on AgentRuntime CRs +- Migrating from JWS signing to mTLS-based identity + +**Does not apply when**: +- Using Istio service mesh for mTLS (explicitly out of scope, separate effort in [#399](https://github.com/kagenti/kagenti-operator/issues/399)) +- Working with user-supplied certificates or cert-manager (future iteration) +- Authbridge plugin changes (orthogonal) +- Cross-cluster agent federation (future work) + +## Key Decisions + +1. **SPIRE only, no Istio dependency**: The spec explicitly scopes to SPIRE-based mTLS. Istio service mesh mTLS (L4, pod-to-pod) is a separate effort tracked in [#399](https://github.com/kagenti/kagenti-operator/issues/399) and [PR #383](https://github.com/kagenti/kagenti-operator/pull/383). These are complementary (SPIRE = application-layer identity, Istio = infrastructure-layer encryption), not competing. + +2. **Permissive as default, not strict**: Accepts both TLS and plaintext inbound. This allows gradual rollout without breaking existing agents that haven't enabled SPIRE yet. + +3. **Controller uses go-spiffe SDK directly**: SpiffeFetcher uses `go-spiffe/v2` X509Source in-process, not file-based certificates. This is different from the data-plane approach (spiffe-helper sidecar writing PEM files). + +4. **Feature-gated via existing CRD field**: No new CLI flag for mTLS enablement. The `mTLSMode` field on AgentRuntimeSpec is the control surface. `--enable-verified-fetch` is retained as a kill switch only. + +5. **JWS signing soft-deprecated**: Flags default to false (already on main), warnings added. No code removal yet, just signaling the migration path. + +## Areas Needing Attention + +- **Overlap with Istio mTLS work**: [PR #383](https://github.com/kagenti/kagenti-operator/pull/383) (SharedTrust controller, already merged) and [Issue #399](https://github.com/kagenti/kagenti-operator/issues/399) (Istio auto-labeling) introduce Istio-based mTLS at the infrastructure layer. This spec operates at the application layer (SPIRE). Reviewers should verify these don't conflict at the configuration level (e.g., what happens when both SPIRE mTLS and Istio mTLS are active on the same workload). + +- **Annotation + env var contract**: T005 (controller sets `kagenti.io/mtls-mode` annotation, webhook sets `MTLS_MODE` env var) and T020 (verifying authbridge reads this env var) are the critical integration point. If authbridge doesn't read `MTLS_MODE`, mTLS mode changes have no effect. + +- **MTLSReady condition gating Ready**: T016 proposes that `MTLSReady=False` should affect the overall `Ready` condition. The exact behavior (block Ready entirely vs. add a warning) needs careful design, since it changes the controller's availability semantics. + +- **SPIRE detection heuristic**: Using spiffe-helper volume mount presence as a proxy for "SPIRE is available" may not cover all deployment patterns (e.g., SPIRE with CSI driver instead of init container). + +## Open Questions + +- How do SPIRE-based mTLS (this spec) and Istio-based mTLS ([PR #383](https://github.com/kagenti/kagenti-operator/pull/383), [Issue #399](https://github.com/kagenti/kagenti-operator/issues/399)) coexist? Is double encryption acceptable, or should one disable when the other is active? +- Should `MTLSReady=False` block the overall `Ready=True` condition, or just add a warning? +- Does the SPIRE detection heuristic (spiffe-helper volume check) cover SPIRE CSI driver deployments? + +## Review Checklist + +- [ ] Key decisions are justified (especially SPIRE-only vs Istio interaction) +- [ ] Scope matches the stated boundaries (no Istio, no user certs, no cross-cluster) +- [ ] Constitution compliance verified (all 5 principles addressed in plan.md) +- [ ] Annotation + env var contract matches authbridge expectations +- [ ] No conflict with existing SharedTrust controller ([PR #383](https://github.com/kagenti/kagenti-operator/pull/383)) +- [ ] Task reconciliation against `main` is accurate ([DONE]/[PARTIAL] markers) +- [ ] Success criteria are achievable and testable +- [ ] Deprecation warnings are clear and actionable + +--- + + diff --git a/specs/003-mtls-transport-security/brainstorm.md b/specs/003-mtls-transport-security/brainstorm.md new file mode 100644 index 00000000..8c461ec0 --- /dev/null +++ b/specs/003-mtls-transport-security/brainstorm.md @@ -0,0 +1,203 @@ +# Brainstorm: mTLS Transport Security for Agent Communication + +**Date**: 2026-06-03 +**Branch**: `mtls-spec` +**Jira**: RHAIENG-4944 — Agent Discovery via mTLS +**Parent Epic**: RHAIENG-4931 — Minimal AgentRuntime CRD Rework +**Parent Feature**: RHAISTRAT-1599 — Productize & Downstream the Agent Operator +**ADR**: ODH-ADR-AgentOps-0002 — Agent Network Policy and mTLS Identity +**Target Release**: rhoai-3.5.EA2 (deadline June 15-19, 2026) + +## Scope Decisions + +### What this spec covers + +**mTLS transport security for two communication paths:** + +1. **Controller-to-agent** (control plane → data plane) — the operator controller fetching agent cards and communicating with agent workloads over mTLS +2. **Agent-to-agent** (data plane ↔ data plane) — inter-agent calls where both sides prove identity via mutual TLS certificates + +### What this spec does NOT cover + +- Card discovery / `status.card` population (covered by spec 001) +- `spec.policy` enforcement (NetworkPolicy, AuthorizationPolicy) — separate spec +- AgentCard CRD deprecation and removal — separate migration spec +- AgentMesh CRD design — future work +- Cross-cluster agent federation — future work +- Bearer token / OAuth2 authorization (handled by authbridge, orthogonal to mTLS) + +## Clarification Answers + +### Q1: What sidecar mode should mTLS target? + +**Answer: All sidecar modes.** mTLS must work across all authbridge sidecar modes: +- **Envoy sidecar** (full proxy with iptables interception) +- **Proxy sidecar** (lightweight HTTP_PROXY-based) +- **Lite mode** (minimal footprint) +- **Waypoint** (standalone deployment, not injected) + +The mTLS implementation must be sidecar-agnostic — the certificate presentation and verification happens at whatever proxy layer is present. + +### Q2: What certificate sources should be supported? + +**Answer: SPIRE only.** Single certificate provider for the initial implementation: + +1. **SPIRE (default and only)** — SPIRE-issued X.509 SVIDs via the Workload API. Already deployed for JWT SVIDs in Kagenti. The spiffe-helper sidecar or go-spiffe SDK provides certificates. + +Istio, user-supplied certificates, and cert-manager are explicitly out of scope. No Istio dependency — Istio support can be added in a future iteration if needed. + +### Q3: How should the controller obtain its SPIFFE identity? + +**Answer: go-spiffe SDK directly in the controller binary (option b).** The controller talks to the SPIRE Workload API directly. This is already implemented and working via `SpiffeFetcher` using `go-spiffe` X509Source. No authbridge sidecar on the controller pod. + +### Q4: How should mTLS be enforced? + +**Answer: Enabled by default, disabled is opt-in.** mTLS is on by default when SPIRE is available. Operators explicitly opt out per-AgentRuntime with `mTLSMode: disabled` if needed. No global feature flag for enforcement — it's the default behavior. + +### Q6: How should certificates reach the data-plane sidecar? + +**Answer: spiffe-helper sidecar (file-based, option a).** A spiffe-helper container fetches SVIDs from SPIRE and writes them as PEM files (`svid.pem`, `svid_key.pem`, `bundle.pem`) to a shared volume. The proxy reads these files and reloads on change. This is already implemented and deployed. + +**Why not go-spiffe SDK directly in the proxy (option b)?** Option (b) eliminates the spiffe-helper container and keeps certs in memory, but it ties the proxy to SPIRE's Workload API. Option (a) keeps the proxy certificate-source-agnostic — it reads PEM files regardless of where they came from. This preserves the sidecar-agnostic constraint and works across all authbridge modes. Option (b) is the long-term direction per `kagenti-extensions#332` but is deferred. + +### Q7: What happens when mTLS is enabled but SPIRE is not deployed? + +**Answer: Fail clearly (option b).** If mTLS is the default and SPIRE isn't available, set the AgentRuntime to an error state with a clear condition. The operator must either deploy SPIRE or explicitly set `mTLSMode: disabled`. No silent fallback to plain HTTP — that would mask a real security gap. + +### Q8: Should this spec cover authbridge code changes in kagenti-extensions? + +**Answer: Yes — cover both repos.** This spec defines the work required in both `kagenti-operator` (controller, CRD, webhook) and `kagenti-extensions` (authbridge proxy TLS contexts). The spec captures what needs to change in authbridge to enable mTLS (DownstreamTlsContext, UpstreamTlsContext, certificate loading). Downstreaming logistics (how to bring authbridge code into the product build) are explicitly out of scope — that's a separate spike. + +### Q5: Istio integration? + +**Answer: Not supported in this iteration.** Istio is explicitly out of scope for this spec. No dependency on Istio. The platform uses SPIRE as the sole mTLS provider. Note: Istio service mesh with L4 mTLS is being worked on separately (PR #383, Issue #399, RHAIENG-5467) — that work is complementary, not competing. See spec.md "Coexistence with Istio mTLS" section. + +## Source Material + +### ADR Key Points (ODH-ADR-AgentOps-0002) + +- **mTLS is mandatory** — all agent-to-agent and controller-to-agent communication must use mutual TLS. This replaces card signing: both sides prove identity at transport layer, on every connection. +- **Istio is not a hard dependency** — SPIRE with authbridge is the default. Istio can replace or complement when configured. Platform must work without Istio. +- **Two layers, two concerns** — mTLS handles transport-level identity (both sides verify via certificates). Authbridge handles application-level authorization via bearer tokens (OAuth2 flow). These are complementary, not replacements. +- **Why mTLS replaces card signing** — card signing proves a card was signed at deploy time, not that the agent serving it now is who it claims to be. mTLS proves the latter on every connection. Eliminates skeleton-card problem (#292). +- **Certificate source pluggable** — SPIRE (default, recommended), user-supplied, cert-manager. From authbridge's perspective, it needs a key and a certificate; where they come from is a deployment decision. +- **Event-driven, not polling** — card content is static for Pod lifetime. Verification triggers on workload rollouts. + +### Meeting Decisions (May 21, 2026 — Sync with Kevin) + +- **Kevin's PR is the foundation** — SpiffeFetcher, go-spiffe integration, mTLS client for controller-to-agent card fetching. Reuse library helpers. +- **Agent-to-agent mTLS lives on the sidecar** — data plane concern, not control plane. Implementation goes into authbridge, not the controller. +- **Ingress + egress** — agent needs mTLS on both slots: outgoing calls (to tools/other agents) and incoming calls (from other agents/controller). +- **Same SPIRE server for control + data plane** — team discussed security implications, no significant concerns identified. Identities are explicitly reflected in client/server certificates. +- **Customer Istio environments** — if customer uses Istio for mTLS, default to their Istio. Don't enforce a custom solution that conflicts with cluster-level security policies. +- **Envoy resource concern** — 100-200MB per pod is heavy for one-agent-per-pod model. Exploring lightweight proxy alternatives. Not blocking mTLS work, but inform design. +- **Agent Runtime Contract** — document requirements for agent behavior (respect HTTP_PROXY, propagate bearer tokens). Contract mounted in agent container. Not in scope for this spec but informs sidecar design. +- **VAP over webhooks** — team chose Validating Admission Policies for label validation (lighter weight than validating webhooks). + +### Jira Context (RHAIENG-4944) + +- **Summary**: Replace AgentCard polling loop with on-demand mTLS fetches +- **Assignee**: Ian Miller +- **Status**: In Progress +- **Description**: Card data comes from the live agent, not a cached CRD. Controller fetches `/.well-known/agent-card.json` over mTLS on workload rollout events, stores result in `status.card` on AgentRuntime. + +## What Already Exists in the Codebase + +### API Types (`api/v1alpha1/agentruntime_types.go`) + +- **`MTLSMode` field** on `AgentRuntimeSpec`: enum with values `disabled`, `permissive`, `strict` + - Auto-enables SPIRE when non-disabled +- **`CardStatus` struct**: holds fetched A2A agent card with mTLS metadata + - `TransportSecurity` (mtls | http) + - `AttestedAgentSpiffeID` — SPIFFE ID from peer cert + - `ValidSignature`, `SignatureKeyID`, `SignatureVerificationDetails` +- **`SPIFFEIdentity` struct**: per-workload SPIFFE identity overrides with trust domain + +### Fetcher Layer (`internal/agentcard/fetcher.go`) + +- **`Fetcher` interface**: plain HTTP card fetch +- **`AuthenticatedFetcher` interface**: mTLS-authenticated fetch +- **`SpiffeFetcher`**: uses go-spiffe X509Source, verifies peer cert SPIFFE ID against trust domain +- **`ConfigMapFetcher`**: reads signed cards from ConfigMap, falls back to HTTP + +### Signature Verification (`internal/signature/x5c.go`) + +- **`X5CProvider`**: verifies JWS signatures via x5c certificate chains against trust bundle +- Trust bundle supports PEM and SPIFFE JSON formats +- 5-minute auto-refresh interval +- Change detection with hash comparison + +### Feature Flags (`cmd/main.go`) + +- `--enable-card-discovery` — activates card discovery into `status.card` +- `--enable-verified-fetch` — activates mTLS-authenticated fetch via SPIFFE +- `--spire-trust-domain` — required when verified fetch enabled +- `--spire-trust-bundle-configmap` — ConfigMap/NS/key for trust bundle +- `--verified-fetch-spiffe-socket` — SPIFFE Workload API socket path +- `--require-a2a-signature` — enforce JWS signature validation (to be deprecated) +- `--signature-audit-mode` — log failures without blocking (to be deprecated) +- `--enforce-network-policies` — create NetworkPolicies (to be deprecated) + +### Controllers + +- **AgentRuntimeReconciler**: has card fetch phase gated by `EnableCardDiscovery`, uses `SpiffeFetcher` when available +- **AgentCardReconciler**: has `EnableVerifiedFetch`, `AuthenticatedFetcher`, SVID expiry grace period, signature verification +- **AgentCardNetworkPolicyReconciler**: creates NetworkPolicies based on signature verification (to be replaced by policy enforcement from AgentRuntime) + +## Authbridge (kagenti-extensions) Current State + +The authbridge in `kagenti-extensions` already has significant mTLS infrastructure: + +### Existing mTLS Code + +- **`authlib/tls/server.go`** — `ServerConfig()` builds mTLS `*tls.Config` for reverse-proxy listener. Presents local SVID, requires client cert, verifies against SPIRE trust bundle. Hot-rotation via per-handshake callbacks (re-reads cert+key from disk). +- **`authlib/tls/client.go`** — `ClientConfig()` builds mTLS `*tls.Config` for forward-proxy dialer. Presents local SVID, verifies server cert against trust bundle. Same rotation-aware pattern. +- **`authlib/spiffe/source.go`** — `X509Source` interface: `Certificate()` returns local SVID, `TrustBundle()` returns trust bundle CertPool. Abstraction over spiffe-helper file-based certs. +- **`authlib/spiffe/provider.go`** — File-based X509Source implementation reading from spiffe-helper written PEM files. + +### Proxy Binaries + +Three proxy variants in `authbridge/cmd/`: +- **`authbridge-proxy`** — full proxy-sidecar (HTTP_PROXY based) +- **`authbridge-lite`** — lightweight variant +- **`authbridge-envoy`** — Envoy-based sidecar + +Each has its own `main.go`, `Dockerfile`, and `entrypoint.sh`. + +### Key Observation — mTLS is ALREADY IMPLEMENTED in authbridge + +**All three proxy modes (proxy, lite, envoy) already have full mTLS wiring on main.** This significantly changes the scope of work: + +- `authbridge-proxy/main.go` and `authbridge-lite/main.go` both have identical mTLS wiring: + - Read `cfg.MTLS` config block, require SPIFFE provider when set + - Build `reverseproxy.MTLSOptions` (inbound) and `forwardproxy.MTLSOptions` (outbound) + - Permissive mode: inbound TLS-sniffing (peek-and-route), outbound plaintext + - Strict mode: inbound rejects non-TLS, outbound TLS-or-fail + - Shared `X509Source` across both listeners for consistent SVID + trust bundle + - `authtls.Metrics` for TLS handshake observability +- `reverseproxy.Server.Listen()` uses a byte-peek TLS-sniffing listener when mTLS is enabled +- The `authlib/tls` package provides `ServerConfig()` (mTLS reverse-proxy) and `ClientConfig()` (mTLS forward-proxy) +- The `authlib/spiffe` package provides `X509Source` interface and `Provider` implementation +- Certificate rotation is handled per-handshake (re-reads from spiffe-helper files) + +**The remaining authbridge work is likely:** +- Ensuring the operator passes the right MTLS config to the sidecar ConfigMap +- Testing mTLS across all three modes in integration +- Potentially: Envoy-specific DownstreamTlsContext/UpstreamTlsContext (if not already done) + +**The main remaining work is on the operator side:** +- Setting `kagenti.io/mtls-mode` annotation on pod template + webhook sets `MTLS_MODE` env var on authbridge +- Making mTLS enabled by default +- Error conditions when SPIRE is unavailable +- Controller-to-agent mTLS (SpiffeFetcher integration) +- Deprecating/removing JWS signing pipeline + +## Key Design Constraints + +1. **One agent per pod** — pod identity = agent identity (SPIFFE ID) +2. **Sidecar-agnostic** — mTLS must work across all authbridge modes +3. **SPIRE as default, Istio when present** — detect Istio enrollment per namespace +4. **No breaking changes** — existing deployments without mTLS must continue working +5. **Feature-gated** — mTLS enforcement behind flags, opt-in +6. **Authbridge prerequisite** — authbridge needs TLS contexts (DownstreamTlsContext, UpstreamTlsContext) added to envoy config +7. **EA2 deadline** — June 15-19, 2026 diff --git a/specs/003-mtls-transport-security/contracts/authbridge-mtls-config.md b/specs/003-mtls-transport-security/contracts/authbridge-mtls-config.md new file mode 100644 index 00000000..4d197311 --- /dev/null +++ b/specs/003-mtls-transport-security/contracts/authbridge-mtls-config.md @@ -0,0 +1,53 @@ +# Contract: Authbridge mTLS Configuration + +> **SUPERSEDED**: The ConfigMap-based `mtls:` block injection approach has been replaced by an annotation + env var approach per PR #405 team review. See below for the current contract. + +## Owner + +kagenti-operator controller + webhook (producer) → authbridge sidecar (consumer) + +## Current Contract: Annotation + Env Var + +### Controller → Workload Pod Template + +The controller sets an annotation on the workload's pod template: + +```yaml +metadata: + annotations: + kagenti.io/mtls-mode: "permissive" # or "strict" or "disabled" +``` + +When `mTLSMode` changes on the AgentRuntime CR, this annotation changes, triggering a rolling restart via Kubernetes pod template change detection. + +### Webhook → Authbridge Sidecar + +At pod CREATE time, the webhook reads `mTLSMode` from the AgentRuntime CR and sets an environment variable on the authbridge sidecar container: + +```yaml +env: + - name: MTLS_MODE + value: "permissive" # or "strict" or "disabled" +``` + +Authbridge reads `MTLS_MODE` at startup to configure its TLS listeners. + +## Behavior Contract + +| mTLSMode | Annotation value | Env var | Inbound behavior | Outbound behavior | +|----------|-----------------|---------|-----------------|------------------| +| `disabled` | `disabled` | `disabled` | Plaintext only | Plaintext only | +| `permissive` | `permissive` | `permissive` | TLS-sniff: accepts TLS and plaintext | Plaintext | +| `strict` | `strict` | `strict` | TLS required, rejects plaintext | TLS required | + +## Separation of Concerns + +| Component | Responsibility | +|-----------|---------------| +| **Controller** | Sets `kagenti.io/mtls-mode` annotation on pod template. Triggers rolling restart on change. | +| **Webhook** | Reads `mTLSMode` from AgentRuntime CR at pod CREATE. Sets `MTLS_MODE` env var on authbridge container. | +| **Authbridge** | Reads `MTLS_MODE` env var at startup. Configures TLS listeners accordingly. No Kubernetes API dependency. | + +## SPIFFE Config (existing, no change) + +The SPIFFE block is already handled by the webhook when spiffe-helper is injected. No changes needed. diff --git a/specs/003-mtls-transport-security/data-model.md b/specs/003-mtls-transport-security/data-model.md new file mode 100644 index 00000000..31fa6cc2 --- /dev/null +++ b/specs/003-mtls-transport-security/data-model.md @@ -0,0 +1,116 @@ +# Data Model: mTLS Transport Security for Agent Communication + +## Entity Changes + +### Modified: AgentRuntimeSpec + +| Field | Change | Before | After | +|-------|--------|--------|-------| +| `mTLSMode` | Default changed | `disabled` | `permissive` | + +No new fields added to spec. The existing `mTLSMode` field (enum: `disabled`, `permissive`, `strict`) is reused with a changed default. + +### Modified: AgentRuntimeStatus (via conditions) + +New condition type added: + +| Condition | Description | +|-----------|-------------| +| `MTLSReady` | Whether mTLS infrastructure (SPIRE) is available for this workload | + +### Existing: CardStatus (AgentRuntime status.card) + +No changes to the struct. The following fields are already defined and will be populated by the mTLS-enabled card fetch: + +| Field | Type | Description | Populated When | +|-------|------|-------------|---------------| +| `transportSecurity` | TransportSecurity | `mtls` or `http` | Every card fetch | +| `attestedAgentSpiffeID` | string | SPIFFE ID from peer certificate | mTLS card fetch | +| `validSignature` | *bool | JWS signature validation result | Signature present (deprecated path) | + +### New Condition: MTLSReady + +Added to AgentRuntime `status.conditions[]`. + +| Reason | Status | When | +|--------|--------|------| +| `SPIREAvailable` | True | SPIRE is deployed and certificates are available to the workload | +| `SPIREUnavailable` | False | mTLSMode is permissive or strict but SPIRE infrastructure is not detected | +| `MTLSDisabled` | True | mTLSMode is explicitly set to `disabled` | + +### New: Pod Template Annotation + +The controller sets a `kagenti.io/mtls-mode` annotation on the workload's pod template. This serves two purposes: (1) triggers a rolling restart when the value changes, and (2) makes the mTLS mode visible on the pod. + +| Annotation | Values | Set By | +|------------|--------|--------| +| `kagenti.io/mtls-mode` | `permissive`, `strict`, `disabled` | Controller (on pod template) | + +### New: Authbridge Sidecar Env Var + +The webhook sets an `MTLS_MODE` environment variable on the authbridge sidecar container at pod CREATE time. + +| Env Var | Values | Set By | Read By | +|---------|--------|--------|---------| +| `MTLS_MODE` | `permissive`, `strict`, `disabled` | Webhook (at pod CREATE) | Authbridge (at startup) | + +### Modified: Feature Flags (cmd/main.go) + +| Flag | Before Default | After Default | Notes | +|------|---------------|---------------|-------| +| `--enable-verified-fetch` | `false` | `true` | Kill switch retained for one release | +| `--require-a2a-signature` | `true` | `false` | Deprecated | +| `--signature-audit-mode` | `true` | `false` | Deprecated | +| `--enforce-network-policies` | `true` | `false` | Deprecated | + +## Relationships + +``` +AgentRuntime.spec.mTLSMode + │ + ├── Controller sets kagenti.io/mtls-mode annotation on pod template + │ │ + │ └── Annotation change triggers rolling restart + │ + ├── Webhook reads mTLSMode from AgentRuntime CR at pod CREATE + │ │ + │ └── Sets MTLS_MODE env var on authbridge sidecar container + │ │ + │ ├── Inbound (reverse proxy): mTLS termination + │ └── Outbound (forward proxy): mTLS origination + │ + ├── Operator sets MTLSReady condition on AgentRuntime.status + │ │ (informational — does NOT block Ready condition) + │ │ + │ ├── SPIREAvailable (True) — SPIRE detected + │ ├── SPIREUnavailable (False) — SPIRE not detected + Warning Event + │ └── MTLSDisabled (True) — mTLSMode: disabled + │ + └── Controller uses SpiffeFetcher (when SPIRE available) + │ + └── Fetches A2A card from live agent over mTLS + │ + ├── status.card.transportSecurity = "mtls" + └── status.card.attestedAgentSpiffeID = +``` + +## State Transitions for mTLSMode + +``` +[Default: permissive] -- operator sets mTLSMode: strict --> [Strict] +[Default: permissive] -- operator sets mTLSMode: disabled --> [Disabled] +[Strict] -- operator removes mTLSMode --> [Default: permissive] +[Disabled] -- operator removes mTLSMode --> [Default: permissive] + +ConfigMap hash changes on every mTLSMode transition → rolling restart +``` + +## State Transitions for MTLSReady condition + +``` +[Not Set] -- reconcile with mTLSMode non-disabled + SPIRE available --> [True/SPIREAvailable] +[Not Set] -- reconcile with mTLSMode non-disabled + SPIRE absent --> [False/SPIREUnavailable] +[Not Set] -- reconcile with mTLSMode: disabled --> [True/MTLSDisabled] +[False/SPIREUnavailable] -- SPIRE deployed --> [True/SPIREAvailable] +[True/SPIREAvailable] -- mTLSMode changed to disabled --> [True/MTLSDisabled] +``` diff --git a/specs/003-mtls-transport-security/pending-changes.md b/specs/003-mtls-transport-security/pending-changes.md new file mode 100644 index 00000000..f92d4942 --- /dev/null +++ b/specs/003-mtls-transport-security/pending-changes.md @@ -0,0 +1,13 @@ +# Pending Spec Changes (awaiting team alignment) + +**Date**: 2026-06-05 +**Blocked on**: Team review of PR #401 (spec) + PR #405 review comment (mTLS annotation approach) + +## Changes needed after alignment + +1. **FR-009**: Change from "rolling restart via ConfigMap hash change" to annotation-based restart (`kagenti.io/mtls-mode` annotation on pod template) +2. **Remove ConfigMap mtls: block injection**: Controller does NOT inject `mtls:` block into authbridge ConfigMap. The webhook reads `mTLSMode` from the AgentRuntime CR and configures authbridge at pod CREATE time. +3. **Update tasks T005 and contracts/authbridge-mtls-config.md** to reflect webhook-driven config delivery instead of controller-driven ConfigMap injection. +4. **Add**: Controller sets `kagenti.io/mtls-mode` annotation on the pod template. When `mTLSMode` changes on the CR, annotation changes → pod template changes → rolling restart. +5. **Add**: Webhook reads `mTLSMode` from AgentRuntime CR at pod CREATE time, sets `MTLS_MODE` env var on authbridge sidecar container. +6. **Clarify controller/webhook boundary**: Controller owns labels, annotations, status. Webhook owns sidecar injection and container configuration. diff --git a/specs/003-mtls-transport-security/plan.md b/specs/003-mtls-transport-security/plan.md new file mode 100644 index 00000000..657bd0d3 --- /dev/null +++ b/specs/003-mtls-transport-security/plan.md @@ -0,0 +1,104 @@ +# Implementation Plan: mTLS Transport Security for Agent Communication + +**Branch**: `003-mtls-transport-security` | **Date**: 2026-06-03 | **Spec**: [spec.md](spec.md) + +**Input**: Feature specification from `specs/003-mtls-transport-security/spec.md` + +## Summary + +Enable mTLS as the default transport security for controller-to-agent and agent-to-agent communication. Authbridge mTLS is already implemented in kagenti-extensions across all proxy modes (envoy, proxy-sidecar, lite). The remaining work is operator-side: the controller sets a `kagenti.io/mtls-mode` annotation on the pod template (triggering rolling restart on change), the webhook reads `mTLSMode` from the AgentRuntime CR and sets `MTLS_MODE` env var on the authbridge container, defaulting `mTLSMode` to `permissive`, wiring `SpiffeFetcher` as the default card fetcher, adding `MTLSReady` status conditions, and deprecating the JWS signing pipeline flags. SPIRE is the sole certificate provider; Istio is out of scope. + +## Technical Context + +**Language/Version**: Go 1.25, controller-runtime v0.23.3 + +**Primary Dependencies**: controller-runtime, go-spiffe/v2, k8s.io/apimachinery + +**Storage**: Kubernetes CRD status subresource (no external storage) + +**Testing**: Ginkgo/Gomega (unit + integration), envtest for controller tests, e2e in `test/e2e/` + +**Target Platform**: Kubernetes 1.31+ + +**Project Type**: Kubernetes operator (kubebuilder-based) + +**Performance Goals**: mTLS adds < 5ms to TLS handshake; no polling (event-driven only) + +**Constraints**: No Istio dependency; feature-gated where appropriate (Constitution V); SPIRE must be deployed for mTLS; backward compatible with existing deployments + +**Scale/Scope**: Hundreds of AgentRuntimes per cluster; one SPIRE-issued SVID per agent pod + +## Constitution Check + +*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.* + +| Principle | Status | Notes | +|-----------|--------|-------| +| I. Reconciler Status Integrity | PASS | `MTLSReady` condition and `status.card` mutations must use save/restore pattern around any Patch calls. Plan ensures status updates happen after metadata patches. | +| II. Spec-Anchored Testing | PASS | All tests will create objects in envtest and read back from API server to verify conditions and status.card fields. | +| III. Controller-Runtime Safety | PASS | ConfigMap hash annotation (metadata patch) happens before Status().Update() for MTLSReady and card fields. No blocking HTTP calls without timeout. | +| IV. CRD-First Design | PASS | `MTLSReady` condition type added as a constant. `mTLSMode` default change is in the CRD schema. All status fields use concrete types with JSON tags. | +| V. Feature-Gated Rollout | PASS | `--enable-verified-fetch` retained as kill switch (default: true). Signing flags default to false with deprecation warnings. mTLS default-on is via `mTLSMode` field default, not a flag — operators can set `disabled` to opt out. | + +No constitution violations. No complexity justification needed. + +## Project Structure + +### Documentation (this feature) + +```text +specs/003-mtls-transport-security/ +├── brainstorm.md # Design exploration and clarifications +├── spec.md # Feature specification +├── plan.md # This file +├── research.md # Phase 0 output +├── data-model.md # Phase 1 output +├── contracts/ # Phase 1 output +└── tasks.md # Phase 2 output (/speckit-tasks) +``` + +### Source Code (kagenti-operator) + +```text +kagenti-operator/ +├── api/v1alpha1/ +│ ├── agentruntime_types.go # MODIFY: mTLSMode default, MTLSReady condition type +│ └── zz_generated.deepcopy.go # REGENERATE: after type changes +├── internal/controller/ +│ ├── agentruntime_controller.go # MODIFY: set kagenti.io/mtls-mode annotation, MTLSReady condition, SpiffeFetcher default +│ ├── agentruntime_controller_test.go # MODIFY: tests for mTLS annotation, conditions, defaults +├── internal/webhook/injector/ +│ └── pod_mutator.go # MODIFY: read mTLSMode from AgentRuntime CR, set MTLS_MODE env var on authbridge +├── cmd/ +│ └── main.go # MODIFY: flag defaults, deprecation warnings +├── config/ +│ ├── crd/bases/ # REGENERATE: CRD manifests +│ └── rbac/ # VERIFY: no new RBAC needed +└── test/ + ├── e2e/ # MODIFY: mTLS e2e scenario + └── integration/ # MODIFY: mTLS integration test +``` + +### Source Code (kagenti-extensions — verification only) + +```text +kagenti-extensions/authbridge/ +├── cmd/ +│ ├── authbridge-envoy/main.go # VERIFY: mTLS wiring exists +│ ├── authbridge-proxy/main.go # VERIFIED: mTLS wiring complete +│ └── authbridge-lite/main.go # VERIFIED: mTLS wiring complete +├── authlib/ +│ ├── tls/ # VERIFIED: ServerConfig, ClientConfig +│ ├── spiffe/ # VERIFIED: X509Source, Provider +│ ├── config/ # VERIFY: MTLS config schema +│ └── listener/ +│ ├── reverseproxy/ # VERIFIED: MTLSOptions integration +│ └── forwardproxy/ # VERIFIED: MTLSOptions integration +└── tests/ # ADD: mTLS integration tests +``` + +**Structure Decision**: Existing kubebuilder project structure. All changes extend existing files. No new packages or directories needed in the operator. Authbridge changes are verification and testing only — implementation is already on main. + +## Complexity Tracking + +No constitution violations. No complexity justification needed. diff --git a/specs/003-mtls-transport-security/research.md b/specs/003-mtls-transport-security/research.md new file mode 100644 index 00000000..1204429b --- /dev/null +++ b/specs/003-mtls-transport-security/research.md @@ -0,0 +1,84 @@ +# Research: mTLS Transport Security for Agent Communication + +## R1: Authbridge mTLS Config Delivery (UPDATED) + +**Decision**: ~~The operator injects `mtls:` block into the authbridge ConfigMap.~~ **Superseded per PR #405 review.** The controller sets a `kagenti.io/mtls-mode` annotation on the pod template. The webhook reads `mTLSMode` from the AgentRuntime CR at pod CREATE time and sets `MTLS_MODE` env var on the authbridge container. + +**Annotation**: `kagenti.io/mtls-mode: permissive` (or `strict` or `disabled`) on pod template. + +**Env var**: `MTLS_MODE=permissive` (or `strict` or `disabled`) on authbridge container. + +**Resolved mode behavior** (unchanged): +- `permissive`: Inbound uses byte-peek TLS-sniffing (accepts both TLS and plaintext). Outbound uses plaintext. +- `strict`: Inbound rejects non-TLS connections. Outbound requires TLS. + +**Rationale**: ConfigMap injection was the original approach, but PR #405 removes CR-level fields from the config hash. Using an annotation on the pod template ensures mTLSMode changes trigger rolling restarts independently of the config hash. The webhook already reads AgentRuntime CRs at pod CREATE — adding the env var is a natural extension. + +**Alternatives considered**: +- ConfigMap injection (original approach) — rejected: ConfigMap is namespace-level, doesn't support per-AgentRuntime mTLSMode. +- Authbridge watches AgentRuntime CR directly — rejected: bad sidecar design (API server watches, RBAC blast radius, tight coupling). + +## R2: Authbridge SPIFFE Config Schema + +**Decision**: The authbridge sidecar reads SPIFFE configuration from a `spiffe:` block. This is already injected by the operator when spiffe-helper is present. + +**Schema**: +```yaml +spiffe: + socket: "unix:///spiffe-workload-api/spire-agent.sock" + mirrorFiles: true + mirrorDir: "/spiffe-certs" +``` + +**Rationale**: The SPIFFE block is a prerequisite for `mtls:`. The operator already handles this — no changes needed. + +## R3: Rolling Restart Mechanism (UPDATED) + +**Decision**: ~~mTLSMode flows through the config hash.~~ **Superseded per PR #405.** The controller sets a `kagenti.io/mtls-mode` annotation on the workload's pod template. When the annotation value changes (e.g., `permissive` → `strict`), Kubernetes detects a pod template change and triggers a rolling restart. This is independent of the platform config hash (`kagenti.io/config-hash`), which now only reflects cluster + namespace defaults. + +**Rationale**: PR #405 removes all CR-level fields from the config hash (2-layer merge: cluster + namespace only). A separate annotation preserves per-AgentRuntime mTLSMode restart semantics without conflating it with platform config. + +## R4: SpiffeFetcher Default Behavior + +**Decision**: When the controller pod has the SPIRE Workload API socket available (`--verified-fetch-spiffe-socket`), use `SpiffeFetcher` as the primary fetcher. Fall back to `DefaultFetcher` only when SPIRE is not configured. + +**Current behavior**: `SpiffeFetcher` is only used when `--enable-verified-fetch=true` (default: false). Changing the default to `true` and keeping the flag as a kill switch satisfies the "enabled by default, disabled is opt-in" requirement. + +**Rationale**: Minimal code change — just a default value flip. The flag remains for emergency rollback per Constitution V. + +## R5: MTLSReady Condition Design + +**Decision**: Add a new condition type `MTLSReady` to AgentRuntime status conditions. + +**Condition states**: + +| Reason | Status | When | +|--------|--------|------| +| `SPIREAvailable` | True | SPIRE is deployed and certificates are available | +| `SPIREUnavailable` | False | mTLSMode is non-disabled but SPIRE is not deployed or unreachable | +| `MTLSDisabled` | True | mTLSMode is explicitly set to `disabled` (no mTLS expected) | + +**Detection**: The controller checks whether the spiffe-helper init container is present in the workload's pod template and whether the SPIRE agent socket volume mount exists. If mTLSMode is `permissive` or `strict` but these are absent, `MTLSReady=False`. + +**Rationale**: Follows the existing condition pattern (TargetResolved, ConfigResolved, Ready). Uses the same `metav1.Condition` type. + +## R6: Deprecation Warning Implementation + +**Decision**: At operator startup in `cmd/main.go`, after flag parsing, check if any legacy signing flags are set to `true` and log structured deprecation warnings. + +**Flags to deprecate**: +- `--require-a2a-signature` (default: false) +- `--signature-audit-mode` (default: false) +- `--enforce-network-policies` (default: false) + +**Warning format**: `slog.Warn("flag deprecated", "flag", name, "replacement", "mTLS transport security", "removal", "next release")` + +**Rationale**: Structured logging matches existing operator patterns. No runtime behavior change — just warnings. + +## R7: Authbridge-Envoy mTLS Status + +**Decision**: Verify whether `authbridge-envoy/main.go` has the same mTLS wiring as proxy and lite modes. + +**Finding**: Need to verify — the envoy mode uses `ext_proc`/`ext_authz` listeners instead of HTTP reverse/forward proxy. mTLS in envoy mode may be handled via Envoy's native `DownstreamTlsContext`/`UpstreamTlsContext` rather than the Go-level `authtls` package. The operator generates envoy bootstrap config, so TLS contexts need to be included there. + +**Action**: Verify during implementation. If envoy mode uses native Envoy TLS, the operator must inject TLS context config into the envoy bootstrap ConfigMap. diff --git a/specs/003-mtls-transport-security/spec.md b/specs/003-mtls-transport-security/spec.md new file mode 100644 index 00000000..d99d4456 --- /dev/null +++ b/specs/003-mtls-transport-security/spec.md @@ -0,0 +1,233 @@ +# Feature Specification: mTLS Transport Security for Agent Communication + +**Feature Branch**: `003-mtls-transport-security` +**Created**: 2026-06-03 +**Status**: Draft +**Input**: Brainstorm document `specs/003-mtls-transport-security/brainstorm.md` +**ADR**: ODH-ADR-AgentOps-0002 — Agent Network Policy and mTLS Identity +**Jira**: RHAIENG-4944 — Agent Discovery via mTLS +**Parent Feature**: RHAISTRAT-1599 — Productize & Downstream the Agent Operator +**Target Release**: rhoai-3.5.EA2 + +## Scope + +Enable mutual TLS (mTLS) as the default transport security layer for two communication paths in the Kagenti platform: + +1. **Controller-to-agent** — the operator controller fetching agent cards and verifying agent identity over mTLS +2. **Agent-to-agent** — inter-agent calls where both sides prove identity via SPIRE-issued X.509 certificates + +All status updates (transport security, attested SPIFFE ID, mTLS readiness conditions) are written to **AgentRuntime status**, not AgentCard. + +### Out of Scope + +- Istio integration — no Istio dependency; SPIRE is the sole mTLS provider +- User-supplied certificates and cert-manager — deferred to a future iteration +- `spec.policy` enforcement (NetworkPolicy, AuthorizationPolicy) — separate spec +- AgentCard CRD removal — separate migration spec +- Cross-cluster agent federation — future work +- Bearer token / OAuth2 authorization — handled by authbridge, orthogonal to mTLS +- Downstreaming logistics for kagenti-extensions — separate spike +- Webhook injector changes — existing spiffe-helper injection is sufficient + +## Clarifications + +### Session 2026-06-03 + +- Q: What sidecar mode should mTLS target? → A: All sidecar modes (envoy, proxy-sidecar, lite, waypoint). mTLS must be sidecar-agnostic. +- Q: What certificate sources should be supported? → A: SPIRE only. No Istio dependency. +- Q: How should the controller obtain its SPIFFE identity? → A: go-spiffe SDK directly in the controller binary (already implemented via SpiffeFetcher). +- Q: How should mTLS be enforced? → A: Enabled by default (permissive). Disabled is opt-in per-AgentRuntime. +- Q: How should certificates reach the data-plane sidecar? → A: spiffe-helper sidecar (file-based). Keeps proxy certificate-source-agnostic. +- Q: What happens when mTLS is enabled but SPIRE is not deployed? → A: Fail clearly. Error condition on AgentRuntime; operator must deploy SPIRE or set mTLSMode: disabled. + +## Current State — What Already Exists + +### Authbridge (kagenti-extensions) — DONE + +mTLS is fully implemented in authbridge across all three proxy modes on main: + +- `authlib/tls/server.go` — `ServerConfig()` for mTLS reverse-proxy listener (inbound) +- `authlib/tls/client.go` — `ClientConfig()` for mTLS forward-proxy dialer (outbound) +- `authlib/spiffe/source.go` — `X509Source` interface abstracting SPIFFE credentials +- `authlib/spiffe/provider.go` — File-based provider reading from spiffe-helper PEM files +- `authbridge-proxy/main.go` — Full mTLS wiring: `cfg.MTLS` → `MTLSOptions` → listeners +- `authbridge-lite/main.go` — Same mTLS wiring as proxy (size-optimized variant) +- Permissive mode: inbound byte-peek TLS-sniffing, outbound plaintext +- Strict mode: inbound rejects non-TLS, outbound TLS-or-fail +- Per-handshake certificate rotation from spiffe-helper files +- `authtls.Metrics` for TLS handshake observability + +### Operator (kagenti-operator) — Partially Done + +- `MTLSMode` field on `AgentRuntimeSpec`: enum with values `disabled`, `permissive`, `strict` (defaults to `disabled`) +- `CardStatus` struct on `AgentRuntimeStatus`: includes `TransportSecurity`, `AttestedAgentSpiffeID`, `ValidSignature` +- `SpiffeFetcher` in `internal/agentcard/fetcher.go`: mTLS-authenticated fetch using go-spiffe X509Source +- `--enable-verified-fetch` flag gates mTLS fetch (default: `false`) +- `--enable-card-discovery` flag gates card discovery into `status.card` +- Webhook injector auto-injects spiffe-helper when `mTLSMode` is non-disabled + +## User Scenarios & Testing + +### User Story 1 — Agent-to-Agent mTLS by Default (Priority: P1) + +A platform operator deploys two agents. Without any mTLS-specific configuration, both agents communicate over mTLS automatically because SPIRE is deployed and mTLS defaults to permissive. + +**Why this priority**: This is the core value — mTLS should just work without manual configuration when the infrastructure (SPIRE) is present. + +**Independent Test**: Deploy two agent workloads with SPIRE, create AgentRuntimes (no explicit `mTLSMode`), and verify inter-agent calls use mTLS by checking authbridge logs for TLS handshake entries. + +**Acceptance Scenarios**: + +1. **Given** two AgentRuntimes with no explicit `mTLSMode` set and SPIRE deployed in the cluster, **When** agent A calls agent B, **Then** the authbridge sidecar establishes an mTLS connection using SPIRE-issued SVIDs and both agents' SPIFFE IDs are verified against the trust bundle. +2. **Given** an AgentRuntime with `mTLSMode: disabled`, **When** agent A calls this agent, **Then** the authbridge sidecar accepts plaintext HTTP connections (no TLS required). +3. **Given** an AgentRuntime with `mTLSMode: strict`, **When** a plaintext HTTP request arrives, **Then** the authbridge sidecar rejects the connection. + +--- + +### User Story 2 — Controller-to-Agent Communication over mTLS (Priority: P1) + +The operator controller communicates with agent workloads over mTLS by default when SPIRE is available. When fetching the A2A agent card from the live agent endpoint (`/.well-known/agent-card.json`), the controller uses mTLS to verify the agent's identity. The verified SPIFFE identity and transport security metadata are recorded in `AgentRuntime.status.card`. The AgentCard CRD is not involved — the controller talks directly to the agent workload and writes results to AgentRuntime status. + +**Why this priority**: Controller-to-agent identity verification is a core security requirement. The SpiffeFetcher already exists; this wires it as the default. + +**Independent Test**: Deploy an agent with SPIRE identity, create an AgentRuntime, and verify `status.card.transportSecurity` is `mtls` and `status.card.attestedAgentSpiffeID` contains the agent's SPIFFE ID. + +**Acceptance Scenarios**: + +1. **Given** an AgentRuntime targeting a workload with SPIRE identity and the controller has access to the SPIRE Workload API, **When** the controller reconciles the AgentRuntime, **Then** the A2A agent card is fetched from the live endpoint over mTLS and `AgentRuntime.status.card.transportSecurity` is `mtls`. +2. **Given** an AgentRuntime targeting a workload with SPIRE identity, **When** the controller fetches the A2A agent card over mTLS, **Then** `AgentRuntime.status.card.attestedAgentSpiffeID` contains the SPIFFE ID extracted from the peer certificate. +3. **Given** a controller without SPIRE configured (no Workload API socket), **When** the controller reconciles an AgentRuntime, **Then** the A2A agent card is fetched over plain HTTP and `AgentRuntime.status.card.transportSecurity` is `http`. + +--- + +### User Story 3 — Clear Error When SPIRE Is Unavailable (Priority: P1) + +An operator deploying agents without SPIRE but with mTLS defaulting to enabled gets a clear error condition explaining what's wrong and how to fix it. + +**Why this priority**: Fail-clearly prevents silent security gaps. Operators must know when mTLS isn't active. + +**Independent Test**: Create an AgentRuntime in a cluster without SPIRE and verify the `MTLSReady` condition is `False` with reason `SPIREUnavailable`. + +**Acceptance Scenarios**: + +1. **Given** an AgentRuntime with `mTLSMode: permissive` (the default) and no SPIRE deployed, **When** the controller reconciles, **Then** `status.conditions` includes `MTLSReady=False` with reason `SPIREUnavailable` and message `"mTLS requires SPIRE; either deploy SPIRE or set mTLSMode: disabled"`. +2. **Given** an AgentRuntime with `mTLSMode: disabled`, **When** the controller reconciles in a cluster without SPIRE, **Then** `MTLSReady=True` with reason `MTLSDisabled` (mTLS is explicitly opted out — no error). +3. **Given** an AgentRuntime with `MTLSReady=False` and SPIRE is subsequently deployed, **When** the controller reconciles, **Then** `MTLSReady` transitions to `True`. + +--- + +### User Story 4 — JWS Signing Pipeline Deprecation Warning (Priority: P2) + +Operators using the legacy JWS signing pipeline receive deprecation warnings directing them to mTLS as the replacement. + +**Why this priority**: Signals the migration path without breaking existing setups. + +**Independent Test**: Start the operator with `--require-a2a-signature=true` and verify deprecation warnings in the logs. + +**Acceptance Scenarios**: + +1. **Given** the operator is started with `--require-a2a-signature=true`, **When** the operator boots, **Then** a deprecation warning is logged: `"--require-a2a-signature is deprecated; mTLS replaces JWS card signing. This flag will be removed in a future release."`. +2. **Given** the operator is started with `--signature-audit-mode=true`, **When** the operator boots, **Then** a similar deprecation warning is logged. +3. **Given** the operator is started with none of the legacy signing flags, **When** the operator boots, **Then** no deprecation warnings are logged. + +--- + +### Edge Cases + +- What happens when the spiffe-helper is slow to write initial SVIDs? The authbridge proxy blocks at startup until certificates are available (`WaitForCredentialFile` in `authlib/config/resolve.go`). The kubelet restart policy handles prolonged failures. +- What happens when an agent with `mTLSMode: strict` tries to call an agent with `mTLSMode: disabled`? The outbound mTLS handshake fails because the target doesn't present a TLS listener. The authbridge logs the handshake failure. This is expected — strict callers cannot reach non-TLS targets. +- What happens when `mTLSMode` changes on a running AgentRuntime? The ConfigMap hash changes, triggering a rolling restart of the workload. The new sidecar boots with the updated mTLS config. +- What happens during SVID rotation? The authbridge's per-handshake callbacks re-read certificates from disk on every TLS handshake. Rotation is transparent — no restart, no connection drop. +- What happens in a mixed-mode deployment (some agents permissive, some strict)? Permissive agents accept both TLS and plaintext inbound. Strict agents reject plaintext. Permissive outbound is plaintext — a permissive caller cannot reach a strict target. For full mesh mTLS, all agents should be strict. + +### Coexistence with Istio mTLS + +SPIRE-based mTLS (this spec) and Istio-based mTLS (PR #383 SharedTrust controller, Issue #399 Istio auto-labeling) operate at different layers and are complementary, not competing: + +- **SPIRE mTLS** (this spec): Application-layer identity via authbridge sidecars. SPIRE issues X.509 SVIDs that prove agent identity. The authbridge proxy terminates/originates TLS using these certificates. +- **Istio mTLS** (#383, #399): Infrastructure-layer encryption via ztunnel (ambient mode) or Envoy sidecar (sidecar mode). Provides pod-to-pod encryption transparently at L4. + +When both are active on the same workload, traffic is double-encrypted (Istio at L4, then SPIRE at L7 inside the authbridge). This is functionally correct but wastes resources. For this iteration, both can coexist without conflict because they operate on different ports/layers. A future optimization could detect Istio enrollment and skip authbridge TLS when Istio ambient mode covers the same path. + +The operator does not need to detect or interact with Istio in this spec. If Istio is present, it adds a transparent encryption layer underneath; if absent, authbridge handles mTLS on its own. + +### MTLSReady Condition and Ready Condition Interaction + +`MTLSReady=False` does NOT block `Ready=True`. The `Ready` condition reflects whether the workload is configured and running — not whether mTLS is active. Blocking Ready would break existing deployments without SPIRE during the transition period. + +When `MTLSReady=False`, the operator: +- Sets `Ready=True` (workload is functional) +- Sets `MTLSReady=False/SPIREUnavailable` with actionable message +- Emits a Kubernetes Event (type Warning) so `kubectl describe agentruntime` shows the issue + +Operators monitor `MTLSReady` to track mTLS rollout progress across the fleet. + +### SPIRE Detection Heuristic + +The controller detects SPIRE availability by checking for the spiffe-helper init container or the SPIRE agent socket volume mount in the workload's pod template. This covers the standard deployment pattern where the webhook injects spiffe-helper. + +**Known limitation**: SPIRE CSI driver deployments use a `csi` volume type instead of the spiffe-helper init container. The detection heuristic should also check for CSI volumes with driver `csi.spiffe.io`. This is a follow-up enhancement — for the initial implementation, spiffe-helper is the supported pattern. + +## Requirements + +### Functional Requirements + +- **FR-001**: The system MUST default `mTLSMode` to `permissive` when not explicitly set on an AgentRuntime CR. +- **FR-002**: The controller MUST set a `kagenti.io/mtls-mode` annotation on the workload's pod template with the resolved `mTLSMode` value (`permissive`, `strict`, or `disabled`). The webhook reads this annotation (and/or the AgentRuntime CR directly) at pod CREATE time to configure the authbridge sidecar. +- **FR-003**: The webhook MUST set the `MTLS_MODE` environment variable on the authbridge sidecar container based on the AgentRuntime's `mTLSMode` value. Authbridge reads this env var at startup to configure its TLS listeners. +- **FR-004**: The system MUST set an `MTLSReady` condition on AgentRuntime status indicating whether mTLS infrastructure (SPIRE) is available. +- **FR-005**: The system MUST use `SpiffeFetcher` (mTLS) as the default card fetcher when the controller has access to the SPIRE Workload API socket. +- **FR-006**: The system MUST fall back to `DefaultFetcher` (plain HTTP) when SPIRE is not configured on the controller pod. +- **FR-007**: The system MUST record `status.card.transportSecurity` as `mtls` or `http` on AgentRuntime to indicate which transport was used for the card fetch. +- **FR-008**: The system MUST record `status.card.attestedAgentSpiffeID` on AgentRuntime with the SPIFFE ID extracted from the peer certificate when the card is fetched over mTLS. +- **FR-009**: The system MUST trigger a rolling restart of the workload when `mTLSMode` changes. The controller sets a `kagenti.io/mtls-mode` annotation on the pod template; when this annotation value changes, Kubernetes triggers a rolling restart. This is independent of the platform config hash. +- **FR-010**: The system MUST log deprecation warnings at operator startup when legacy JWS signing flags (`--require-a2a-signature`, `--signature-audit-mode`, `--enforce-network-policies`) are set to `true`. +- **FR-011**: The system MUST default legacy JWS signing flags to `false`. +- **FR-012**: The system MUST change the `--enable-verified-fetch` flag default to `true` (kill switch retained for one release cycle). + +### Key Entities + +- **AgentRuntime**: Existing CRD extended with mTLS defaults. `spec.mTLSMode` controls transport security. `status.card` holds transport security metadata. `status.conditions` includes `MTLSReady`. All mTLS-related status goes here, NOT on AgentCard. +- **Authbridge Sidecar**: Per-pod sidecar container configured by the webhook at pod CREATE time. Reads `MTLS_MODE` env var to configure TLS listeners. +- **SPIRE**: External dependency providing X.509 SVIDs via the Workload API. Must be deployed for mTLS to function. +- **spiffe-helper**: Sidecar container that fetches SVIDs from SPIRE and writes PEM files to a shared volume. The authbridge proxy reads these files. +- **SpiffeFetcher**: Existing component in `internal/agentcard/fetcher.go` that performs mTLS-authenticated card fetches using go-spiffe X509Source directly. + +## Success Criteria + +### Measurable Outcomes + +- **SC-001**: Agents deployed with SPIRE communicate over mTLS by default without explicit `mTLSMode` configuration. +- **SC-002**: The controller fetches agent cards over mTLS by default when SPIRE is available, with the transport security and attested SPIFFE ID visible in `AgentRuntime.status.card`. +- **SC-003**: Operators without SPIRE see a clear `MTLSReady=False` condition with actionable guidance. +- **SC-004**: Existing deployments using JWS signing see deprecation warnings and continue functioning during the transition period. +- **SC-005**: mTLS works across all authbridge sidecar modes (envoy, proxy-sidecar, lite). + +## Assumptions + +- SPIRE is deployed in the cluster and the SPIRE agent socket is accessible to both the controller pod and agent workload pods. +- The authbridge mTLS implementation in kagenti-extensions (on main) is stable and tested. +- Each agent workload has exactly one Pod (one-agent-per-pod model), so pod identity equals agent identity (SPIFFE ID). +- The spiffe-helper sidecar is already injected by the webhook when `mTLSMode` is non-disabled — no changes to injection logic are needed. +- The authbridge sidecar supports an `MTLS_MODE` environment variable to configure TLS listener mode at startup. + +## Repositories Affected + +### kagenti-operator (primary) + +| File | Change | +|------|--------| +| `api/v1alpha1/agentruntime_types.go` | Change `mTLSMode` default to `permissive`; add `MTLSReady` condition type | +| `internal/controller/agentruntime_controller.go` | Set `kagenti.io/mtls-mode` annotation on pod template; set `MTLSReady` condition; use SpiffeFetcher by default | +| `internal/controller/agentruntime_controller_test.go` | Tests for mTLS annotation, default mTLS, MTLSReady condition | +| `internal/webhook/injector/pod_mutator.go` | Read `mTLSMode` from AgentRuntime CR; set `MTLS_MODE` env var on authbridge container | +| `cmd/main.go` | Default `--enable-verified-fetch` to `true`; default signing flags to `false`; add deprecation log warnings | +| `config/crd/bases/` | Regenerate CRD manifests if type changes | + +### kagenti-extensions (verification + testing) + +| File | Change | +|------|--------| +| `authbridge/cmd/authbridge-envoy/main.go` | Verify mTLS wiring exists (same as proxy/lite) | +| `authbridge/authlib/listener/` | Integration tests for mTLS handshake across modes | +| `authbridge/authlib/tls/` | Verify permissive/strict behavior matches operator expectations | diff --git a/specs/003-mtls-transport-security/tasks.md b/specs/003-mtls-transport-security/tasks.md new file mode 100644 index 00000000..320e08cf --- /dev/null +++ b/specs/003-mtls-transport-security/tasks.md @@ -0,0 +1,217 @@ +# Tasks: mTLS Transport Security for Agent Communication + +**Input**: Design documents from `specs/003-mtls-transport-security/` + +**Prerequisites**: plan.md, spec.md, research.md, data-model.md, contracts/ + +**Tests**: Test tasks are included since this is a controller change requiring unit and integration test coverage per the constitution (Principle II: Spec-Anchored Testing). + +**Organization**: Tasks are grouped by user story to enable independent implementation and testing of each story. + +**Codebase Audit (2026-06-05)**: Reconciled tasks against current main. Several items are partially or fully implemented. Tasks marked `[DONE]` reflect existing code; tasks marked `[PARTIAL]` need only the remaining delta. + +## Format: `[ID] [P?] [Story] Description` + +- **[P]**: Can run in parallel (different files, no dependencies) +- **[Story]**: Which user story this task belongs to (e.g., US1, US2, US3) +- Include exact file paths in descriptions + +## File Paths + +All paths are relative to `kagenti-operator/` (the Go module root inside the repo): + +- `api/v1alpha1/agentruntime_types.go` — CRD types +- `internal/controller/agentruntime_controller.go` — main reconciler (has `fetchCard()`, conditions, etc.) +- `internal/controller/agentruntime_config.go` — config merge and hash (has `resolvedConfig` with `MTLSMode`) +- `internal/controller/agentruntime_controller_test.go` — controller tests +- `internal/webhook/injector/` — sidecar injection (envoy template, resolved config) +- `cmd/main.go` — flag definitions and wiring + +## What Already Exists on Main + +The following are already implemented and do NOT need new code: + +- `MTLSMode` field on `AgentRuntimeSpec` with values `disabled`, `permissive`, `strict` (defaults to empty/disabled) +- `TransportSecurity` enum (`mtls`, `http`) and `CardStatus` struct on status +- `SpiffeFetcher` / `AuthenticatedFetcher` wired into `fetchCard()` — already chooses mTLS vs HTTP +- Webhook resolves `MTLSMode` from CR > namespace > default chain and configures authbridge sidecar +- `--require-a2a-signature`, `--signature-audit-mode`, `--enforce-network-policies` already default to `false` +- Envoy template has `MTLSEnabled`, `MTLSPermissive`, `MTLSStrict` wiring for TLS contexts +- Webhook resolves `MTLSMode` from CR > namespace > default chain +- `authbridge-runtime-config` ConfigMap content is captured in config hash + +--- + +## Phase 1: Setup + +**Purpose**: CRD type changes, flag defaults, and code generation + +- [ ] T001 Change `mTLSMode` default to `permissive` in `api/v1alpha1/agentruntime_types.go`. Currently the field has no kubebuilder default marker and empty string is treated as disabled. Add `// +kubebuilder:default=permissive` marker on the `MTLSMode` field (line 124). Add `ConditionTypeMTLSReady = "MTLSReady"` constant alongside the existing condition types (line 71-75). +- [ ] T002 Run `make generate && make manifests` to regenerate deepcopy and CRD manifests. Verify the CRD schema shows `default: permissive` for mtlsMode. +- [ ] T003 [P] Change `--enable-verified-fetch` flag default from `false` to `true` in `cmd/main.go` (line 164). Change `--enable-card-discovery` flag default from `false` to `true` in `cmd/main.go` (line 162). Both are needed — verified fetch without card discovery does nothing. +- [ ] T004 [P] Add deprecation warning logs in `cmd/main.go` after `flag.Parse()`. When `--require-a2a-signature`, `--signature-audit-mode`, or `--enforce-network-policies` is explicitly `true`, log: `setupLog.Info("DEPRECATED: flag will be removed in a future release; mTLS transport security replaces JWS signing", "flag", "")`. The defaults are already `false` — no change needed there. + +**Checkpoint**: CRD default updated, flags flipped, code regenerated. + +--- + +## Phase 2: Foundational (Blocking Prerequisites) + +**Purpose**: MTLSReady condition logic and mTLS annotation on pod template + +**CRITICAL**: No user story work can begin until this phase is complete + +- [ ] T005 [SUPERSEDED → REPLACED] Add `kagenti.io/mtls-mode` annotation to the workload's pod template in `internal/controller/agentruntime_controller.go`. In `applyWorkloadConfig()`, set the annotation value to the resolved `mTLSMode` (`permissive`, `strict`, or `disabled`). When `mTLSMode` changes on the AgentRuntime CR, the annotation value changes, causing Kubernetes to trigger a rolling restart. This replaces the previous ConfigMap-based approach — the webhook reads `mTLSMode` from the AgentRuntime CR at pod CREATE time and sets `MTLS_MODE` env var on the authbridge container. Update the webhook in `internal/webhook/injector/pod_mutator.go` to set this env var. +- [ ] T006 Add `MTLSReady` condition logic to the reconcile loop in `internal/controller/agentruntime_controller.go`. Insert after target resolution (around line 170) and before the Ready condition (line 251). Logic: if `mTLSMode` resolves to `disabled` or empty-before-default → `MTLSReady=True/MTLSDisabled`; if `permissive` or `strict` → check whether the workload's pod template has spiffe-helper volume mounts or the SPIRE agent socket mount → if present `MTLSReady=True/SPIREAvailable`, if absent `MTLSReady=False/SPIREUnavailable` with message `"mTLS requires SPIRE; either deploy SPIRE or set mTLSMode: disabled"`. Use `r.setCondition()` (existing helper). Follow save/restore pattern around Patch calls (Constitution I). + +**Checkpoint**: MTLSReady condition and mTLS annotation ready. + +--- + +## Phase 3: User Story 1 — Agent-to-Agent mTLS by Default (Priority: P1) MVP + +**Goal**: Agents deployed with SPIRE communicate over mTLS automatically without explicit mTLSMode configuration because mTLSMode defaults to permissive. + +**Independent Test**: Deploy two agent workloads with SPIRE, create AgentRuntimes with no explicit mTLSMode, verify the pod template has `kagenti.io/mtls-mode: permissive` annotation, authbridge sidecar has `MTLS_MODE=permissive` env var, and `MTLSReady=True`. + +### Tests for User Story 1 + +- [ ] T007 [P] [US1] Add unit tests for `kagenti.io/mtls-mode` annotation in `internal/controller/agentruntime_controller_test.go`. Test cases: (a) mTLSMode unset (defaults to permissive via kubebuilder marker) → pod template annotation is `kagenti.io/mtls-mode: permissive`; (b) mTLSMode `strict` → annotation is `strict`; (c) mTLSMode `disabled` → annotation is `disabled`. Also test webhook: verify `MTLS_MODE` env var is set on authbridge container. Create objects in envtest and read back from API server (Constitution II). +- [ ] T008 [P] [US1] Add unit tests for `MTLSReady` condition in `internal/controller/agentruntime_controller_test.go`. Test cases: (a) mTLSMode permissive + spiffe-helper present → `MTLSReady=True/SPIREAvailable`; (b) mTLSMode permissive + no spiffe-helper → `MTLSReady=False/SPIREUnavailable`; (c) mTLSMode disabled → `MTLSReady=True/MTLSDisabled`. Read back AgentRuntime from envtest API server (Constitution II). +- [ ] T009 [P] [US1] Add unit test for annotation change on mTLSMode transition in `internal/controller/agentruntime_controller_test.go`. Verify that changing mTLSMode from `permissive` to `strict` or `disabled` results in a different `kagenti.io/mtls-mode` annotation on the workload pod template, triggering a rolling restart. Note: mTLSMode is NOT in the config hash (per PR #405) — the annotation is the restart mechanism. + +### Implementation for User Story 1 + +- [ ] T010 [US1] Wire T005 and T006 into the reconcile flow end-to-end. Create an AgentRuntime with no explicit mTLSMode, reconcile, verify: (a) the resolved mTLSMode is `permissive`; (b) the pod template has `kagenti.io/mtls-mode: permissive` annotation; (c) `MTLSReady` condition is set; (d) webhook sets `MTLS_MODE` env var on authbridge container. +- [ ] T011 [US1] Verify that changing `mTLSMode` on an existing AgentRuntime triggers a workload rolling restart via the `kagenti.io/mtls-mode` annotation change on the pod template (NOT via config hash — mTLSMode is excluded from the hash per PR #405). + +**Checkpoint**: Agent-to-agent mTLS defaults to permissive. Annotation and conditions correct. + +--- + +## Phase 4: User Story 2 — Controller-to-Agent Communication over mTLS (Priority: P1) + +**Goal**: The operator controller uses SpiffeFetcher by default when SPIRE is available. Transport security metadata is recorded in AgentRuntime.status.card. + +**Independent Test**: Deploy an agent with SPIRE, create an AgentRuntime, verify `status.card.transportSecurity` is `mtls`. + +**Note**: The `fetchCard()` method (line 863) already implements the mTLS-first, HTTP-fallback logic. The `AuthenticatedFetcher` is already wired. The main change is making it the default via T003 (`--enable-verified-fetch=true`). + +### Tests for User Story 2 + +- [ ] T012 [P] [US2] Add unit tests in `internal/controller/agentruntime_controller_test.go` verifying: (a) when `AuthenticatedFetcher` is set and TLS port exists → `status.card.transportSecurity` is `mtls` and `attestedAgentSpiffeID` is populated; (b) when `AuthenticatedFetcher` is nil → `status.card.transportSecurity` is `http`; (c) when TLS port is missing, falls back to HTTP with a `FallbackToHTTP` event. Use stub fetchers. Read back from envtest (Constitution II). +- [ ] T013 [P] [US2] Add unit test verifying `--enable-verified-fetch=false` (kill switch) results in `AuthenticatedFetcher` being nil even when SPIRE is configured. Test at the `cmd/main.go` wiring level or at the reconciler level with `EnableVerifiedFetch=false`. + +### Implementation for User Story 2 + +- [x] T014 [DONE] [US2] The `fetchCard()` method already wires SpiffeFetcher as the preferred fetcher, falls back to HTTP, and populates `transportSecurity` and `attestedAgentSpiffeID`. No new code needed — T003 (flag default change) activates this path. + +**Checkpoint**: Controller fetches over mTLS by default. Status metadata populated. + +--- + +## Phase 5: User Story 3 — Clear Error When SPIRE Is Unavailable (Priority: P1) + +**Goal**: Operators without SPIRE see a clear MTLSReady=False condition with actionable guidance. + +**Independent Test**: Create an AgentRuntime in a cluster without SPIRE, verify MTLSReady=False/SPIREUnavailable. + +### Tests for User Story 3 + +- [ ] T015 [P] [US3] Add unit test in `internal/controller/agentruntime_controller_test.go` for the SPIRE-unavailable case: (a) workload with no spiffe-helper volume → `MTLSReady=False/SPIREUnavailable` with message containing `"mTLS requires SPIRE"`; (b) verify `Ready` condition reflects the MTLSReady failure. Read back from envtest (Constitution II). + +### Implementation for User Story 3 + +- [ ] T016 [US3] Verify T006's MTLSReady condition works end-to-end. `MTLSReady=False` does NOT block `Ready=True` — the workload is still functional without mTLS. Instead, emit a Kubernetes Warning Event on the AgentRuntime so `kubectl describe` surfaces the issue. Also expand SPIRE detection to check for CSI driver volumes (`csi.spiffe.io`) alongside spiffe-helper. + +**Checkpoint**: SPIRE-unavailable produces clear, actionable conditions. + +--- + +## Phase 6: User Story 4 — JWS Signing Pipeline Deprecation Warning (Priority: P2) + +**Goal**: Operators using legacy signing flags see deprecation warnings. + +**Independent Test**: Start operator with `--require-a2a-signature=true`, verify deprecation log. + +### Tests for User Story 4 + +- [ ] T017 [P] [US4] Add test verifying deprecation warning is logged when legacy flags are set to `true`. This can be a simple test that parses log output or validates the warning logic function. + +### Implementation for User Story 4 + +- [ ] T018 [US4] Verify T004's deprecation warnings work. Ensure all three flags emit warnings. Note: the flags already default to `false` on main — only the warning message is new. + +**Checkpoint**: Deprecation warnings active. + +--- + +## Phase 7: Authbridge Verification (kagenti-extensions) + +**Purpose**: Verify authbridge mTLS is complete and matches operator expectations + +- [x] T019 [DONE] Envoy mTLS wiring confirmed — webhook injector has `MTLSEnabled`, `MTLSPermissive`, `MTLSStrict` template fields driving envoy TLS contexts. +- [ ] T020 [P] Verify the `cfg.MTLS` config schema in `authbridge/authlib/config/config.go` matches what the operator generates. Specifically: does the authbridge expect `mtls:\n mode: permissive` or a different shape? Clone `kagenti-extensions` and check. +- [ ] T021 Add mTLS integration tests (if not already present in authbridge). Test: (a) permissive accepts TLS + plaintext; (b) strict rejects plaintext; (c) cert rotation. + +**Checkpoint**: Authbridge config contract verified. + +--- + +## Phase 8: Polish & Cross-Cutting Concerns + +- [ ] T022 [P] Add e2e test in `test/e2e/` — deploy agents with SPIRE, create AgentRuntimes with default mTLSMode, verify ConfigMap, conditions, and card fetch transport. +- [ ] T023 [P] Update documentation (`GETTING_STARTED.md`) for mTLS-by-default behavior and opt-out via `mTLSMode: disabled`. +- [ ] T024 Run `make generate && make manifests && make test` — verify no regressions. + +--- + +## Summary of Actual Work Needed + +| Task | Status | Work Required | +|------|--------|---------------| +| T001 | NEW | Add kubebuilder default marker + MTLSReady condition constant | +| T002 | NEW | make generate && make manifests | +| T003 | NEW | Flip two flag defaults to true | +| T004 | NEW | Add deprecation warning logs (flag defaults already false) | +| T005 | REPLACED | Set `kagenti.io/mtls-mode` annotation on pod template + webhook sets `MTLS_MODE` env var | +| T006 | NEW | MTLSReady condition logic in reconcile loop | +| T007-T009 | NEW | Unit tests for annotation, condition, annotation change | +| T010-T011 | NEW | End-to-end wiring verification | +| T012-T013 | NEW | Unit tests for SpiffeFetcher default | +| T014 | DONE | fetchCard() already implements mTLS-first logic | +| T015 | NEW | Unit test for SPIRE unavailable | +| T016 | NEW | MTLSReady condition + Warning Event (does not block Ready) | +| T017-T018 | NEW | Deprecation warning test + verification | +| T019 | DONE | Envoy mTLS confirmed in webhook injector | +| T020 | NEW | Verify authbridge config schema match | +| T021-T024 | NEW | Integration tests, e2e, docs, regression check | + +**Net new implementation tasks**: 6 (T001, T003, T004, T005, T006, T016) +**Net new test tasks**: 8 (T007-T009, T012-T013, T015, T017, T022) +**Already done**: 2 (T014, T019) +**Verification/polish**: 6 (T002, T010-T011, T018, T020-T021, T023-T024) + +--- + +## Dependencies & Execution Order + +### Phase Dependencies + +- **Setup (Phase 1)**: No dependencies — start immediately +- **Foundational (Phase 2)**: Depends on T001, T002 +- **User Stories (Phase 3-6)**: All depend on Phase 2 + - US4 can start after T004 (independent of Phase 2) +- **Authbridge Verification (Phase 7)**: Independent — run in parallel +- **Polish (Phase 8)**: After all user stories + +### Recommended Execution Order + +1. T001 → T002 (CRD changes must come first) +2. T003 + T004 in parallel (flag changes) +3. T005 + T006 (foundational — ConfigMap + condition) +4. T007-T011 (US1 tests + wiring) +5. T012-T014 (US2 tests — T014 is already done, just verify) +6. T015-T016 (US3 tests + Ready condition gate) +7. T017-T018 (US4 deprecation) +8. T020-T024 (verification, e2e, docs)