Skip to content

Feat: [spec] mTLS transport security for agent communication (003)#401

Merged
pdettori merged 5 commits into
kagenti:mainfrom
varshaprasad96:003-mtls-transport-security
Jun 10, 2026
Merged

Feat: [spec] mTLS transport security for agent communication (003)#401
pdettori merged 5 commits into
kagenti:mainfrom
varshaprasad96:003-mtls-transport-security

Conversation

@varshaprasad96

@varshaprasad96 varshaprasad96 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds spec-driven development artifacts for mTLS transport security (spec 003)
  • Covers two communication paths: controller-to-agent and agent-to-agent mTLS
  • SPIRE is the sole certificate provider; Istio is explicitly out of scope
  • mTLS defaults to permissive (enabled by default, opt-out with mTLSMode: disabled)
  • Authbridge mTLS is already implemented — remaining work is primarily operator-side

Artifacts

File Purpose
specs/003-mtls-transport-security/spec.md Feature specification with user stories, requirements, acceptance scenarios
specs/003-mtls-transport-security/plan.md Implementation plan with constitution check and project structure
specs/003-mtls-transport-security/research.md Research decisions (config schema, SpiffeFetcher, MTLSReady condition)
specs/003-mtls-transport-security/data-model.md Entity changes, condition states, state transitions
specs/003-mtls-transport-security/tasks.md 24 implementation tasks across 8 phases
specs/003-mtls-transport-security/contracts/ Operator-to-authbridge mTLS config contract
specs/003-mtls-transport-security/brainstorm.md Design exploration and clarification answers

Key Decisions

  • SPIRE only — no Istio dependency
  • mTLS enabled by defaultpermissive mode, operators opt out with mTLSMode: disabled
  • Fail clearlyMTLSReady=False condition when SPIRE is unavailable
  • Controller uses go-spiffe SDK directly (SpiffeFetcher as default)
  • spiffe-helper for data-plane certificates (file-based, sidecar-agnostic)
  • JWS signing deprecated — flags default to false with warnings

References

  • ADR: ODH-ADR-AgentOps-0002
  • Jira: RHAIENG-4944
  • Parent: RHAISTRAT-1599

Test plan

  • Team reviews spec, plan, and task breakdown
  • Validate assumptions about authbridge mTLS state on main
  • Confirm ConfigMap schema matches authbridge expectations
  • Agree on MVP scope (US1: agent-to-agent mTLS by default)

Signed-off-by: Varsha Prasad Narsing varshaprasad96@gmail.com
🤖 Generated with Claude Code

@varshaprasad96 varshaprasad96 requested a review from a team as a code owner June 3, 2026 23:01
@varshaprasad96

Copy link
Copy Markdown
Contributor Author

cc: @r3v5 @akram Can we review this, before moving to implementation

@r3v5 r3v5 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thank you!


1. **SPIRE (default and only)** — SPIRE-issued X.509 SVIDs via the Workload API. Already deployed for JWT SVIDs in Kagenti. The spiffe-helper sidecar or go-spiffe SDK provides certificates.

Istio, user-supplied certificates, and cert-manager are explicitly out of scope. No Istio dependency — Istio support can be added in a future iteration if needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cwiklik

cwiklik commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

The DCO check is failing on this PR — one of the two commits is missing its Signed-off-by trailer:

  • 3ba96e2spec: add mTLS transport security spec (003)not signed off
  • 3f4cae4spec: update tasks.md with codebase audit findings — signed off

To fix, sign off all commits on the branch and force-push:

git rebase --signoff main      # use origin/main or upstream/main if that is your base
git push --force-with-lease

The Signed-off-by: name/email must match the commit author. Once the missing sign-off is added, the DCO check will re-run and pass.

@rhuss

rhuss commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Guide: mTLS Transport Security for Agent Communication

Generated: 2026-06-08 | Spec: specs/003-mtls-transport-security/spec.md

Why This Change

Agent-to-agent and controller-to-agent communication in kagenti currently runs over plaintext HTTP by default. While authbridge already has full mTLS support implemented (permissive/strict modes, SPIRE-based SVIDs, per-handshake cert rotation), the operator doesn't activate it by default. Operators must manually set flags and configure mTLS mode. This spec makes mTLS the default transport security, with clear error conditions when SPIRE is unavailable.

What Changes

  1. mTLS enabled by default: mTLSMode defaults to permissive (was implicitly disabled). Agents communicate over mTLS automatically when SPIRE is deployed.
  2. MTLSReady condition: New status condition on AgentRuntime showing whether mTLS infrastructure (SPIRE) is available, with actionable error messages when it's not.
  3. Controller uses mTLS by default: --enable-verified-fetch and --enable-card-discovery flags flip to true. SpiffeFetcher becomes the default card fetcher.
  4. JWS signing deprecation: Legacy signing flags (--require-a2a-signature, --signature-audit-mode, --enforce-network-policies) emit deprecation warnings.
  5. ConfigMap mtls block: Operator injects mtls: mode: <value> into authbridge ConfigMap based on AgentRuntime spec.

No breaking changes. Existing deployments without SPIRE get a clear MTLSReady=False condition and can opt out with mTLSMode: disabled.

How It Works

The implementation leverages heavily what's already built:

  • Authbridge (kagenti-extensions): mTLS is fully implemented across all proxy modes. No changes needed, only verification.
  • Operator: The main work is (a) changing the mTLSMode kubebuilder default to permissive, (b) injecting the mtls: block into the authbridge ConfigMap, (c) adding MTLSReady condition logic, and (d) flipping flag defaults.
  • SPIRE detection: The controller checks whether spiffe-helper volume mounts exist in the workload's pod template. If absent while mTLS is enabled, MTLSReady=False/SPIREUnavailable.
  • Config hash: resolvedConfig.MTLSMode is already included in the config hash, so mTLSMode changes trigger rolling restarts automatically.

Of 24 tasks, 2 are already done ([DONE]), 1 is partial ([PARTIAL]), and the remaining 21 are new work, mostly focused on the operator side.

When It Applies

Applies when:

  • Deploying agents in clusters with SPIRE
  • Controller fetching agent cards from live workloads
  • Setting or changing mTLSMode on AgentRuntime CRs
  • Migrating from JWS signing to mTLS-based identity

Does not apply when:

  • Using Istio service mesh for mTLS (explicitly out of scope, separate effort in #399)
  • Working with user-supplied certificates or cert-manager (future iteration)
  • Authbridge plugin changes (orthogonal)
  • Cross-cluster agent federation (future work)

Key Decisions

  1. SPIRE only, no Istio dependency: The spec explicitly scopes to SPIRE-based mTLS. Istio service mesh mTLS (L4, pod-to-pod) is a separate effort tracked in #399 and PR #383. These are complementary (SPIRE = application-layer identity, Istio = infrastructure-layer encryption), not competing.

  2. Permissive as default, not strict: Accepts both TLS and plaintext inbound. This allows gradual rollout without breaking existing agents that haven't enabled SPIRE yet.

  3. Controller uses go-spiffe SDK directly: SpiffeFetcher uses go-spiffe/v2 X509Source in-process, not file-based certificates. This is different from the data-plane approach (spiffe-helper sidecar writing PEM files).

  4. Feature-gated via existing CRD field: No new CLI flag for mTLS enablement. The mTLSMode field on AgentRuntimeSpec is the control surface. --enable-verified-fetch is retained as a kill switch only.

  5. JWS signing soft-deprecated: Flags default to false (already on main), warnings added. No code removal yet, just signaling the migration path.

Areas Needing Attention

  • Overlap with Istio mTLS work: PR #383 (SharedTrust controller, already merged) and Issue #399 (Istio auto-labeling) introduce Istio-based mTLS at the infrastructure layer. This spec operates at the application layer (SPIRE). Reviewers should verify these don't conflict at the configuration level (e.g., what happens when both SPIRE mTLS and Istio mTLS are active on the same workload).

  • ConfigMap schema contract: T005 (injecting mtls: block) and T020 (verifying authbridge expects this shape) are the critical integration point. If the authbridge config schema doesn't match what the operator generates, mTLS silently fails.

  • MTLSReady condition gating Ready: T016 proposes that MTLSReady=False should affect the overall Ready condition. The exact behavior (block Ready entirely vs. add a warning) needs careful design, since it changes the controller's availability semantics.

  • SPIRE detection heuristic: Using spiffe-helper volume mount presence as a proxy for "SPIRE is available" may not cover all deployment patterns (e.g., SPIRE with CSI driver instead of init container).

Open Questions

  • How do SPIRE-based mTLS (this spec) and Istio-based mTLS (PR #383, Issue #399) coexist? Is double encryption acceptable, or should one disable when the other is active?
  • Should MTLSReady=False block the overall Ready=True condition, or just add a warning?
  • Does the SPIRE detection heuristic (spiffe-helper volume check) cover SPIRE CSI driver deployments?

Review Checklist

  • Key decisions are justified (especially SPIRE-only vs Istio interaction)
  • Scope matches the stated boundaries (no Istio, no user certs, no cross-cluster)
  • Constitution compliance verified (all 5 principles addressed in plan.md)
  • ConfigMap mtls: block schema matches authbridge expectations
  • No conflict with existing SharedTrust controller (PR #383)
  • Task reconciliation against main is accurate ([DONE]/[PARTIAL] markers)
  • Success criteria are achievable and testable
  • Deprecation warnings are clear and actionable

varshaprasad96 and others added 4 commits June 8, 2026 11:23
Add spec-driven development artifacts for mTLS transport security
covering controller-to-agent and agent-to-agent communication paths.
SPIRE is the sole certificate provider; Istio is out of scope.

Artifacts: spec.md, plan.md, research.md, data-model.md, tasks.md,
brainstorm.md, and authbridge config contract.

Jira: RHAIENG-4944

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Reconciled tasks against current main. Key findings:
- fetchCard() mTLS-first logic already implemented
- Envoy TLS contexts already wired in webhook injector
- Signing flags already default to false
- Marked T014 and T019 as DONE
- Added summary table of actual work needed (6 impl, 8 test, 6 polish)
- Updated file paths and line references to match current code

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Co-authored-by: Roland Huss <rhuss@redhat.com>
Assisted-By: 🤖 Claude Code
- Add Istio coexistence section: SPIRE and Istio mTLS are
  complementary layers, coexist without conflict
- Clarify MTLSReady does NOT block Ready condition; emit
  Warning Event instead for kubectl describe visibility
- Document SPIRE CSI driver as known limitation in detection
  heuristic (spiffe-helper is the supported pattern for now)
- Update T016 task to reflect these decisions

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
@varshaprasad96

Copy link
Copy Markdown
Contributor Author

Using the following points for tracking based on #405 and team review:

  • FR-009: hash-based restart → annotation-based restart (kagenti.io/mtls-mode)
  • Remove ConfigMap mtls: block injection (T005, contracts/) — webhook handles this
  • Add: controller sets kagenti.io/mtls-mode annotation on pod template
  • Add: webhook reads mTLSMode from AgentRuntime CR, sets env var on authbridge container
  • Clarify controller/webhook boundary for mTLS config delivery

@pdettori pdettori left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid spec-first approach. Architecture is sound — leveraging existing authbridge mTLS and adding operator-side wiring. The team has already identified the key pivot via #405 (annotation-based delivery via webhook instead of ConfigMap injection), which simplifies the operator's responsibility.

Main suggestion: update the spec artifacts to reflect the #405 decisions before merge, so the spec doesn't go stale on day one.

Areas reviewed: Architecture/design, scope boundaries, security model, task coherence
Commits: 4 commits, all signed-off ✓
CI: E2E failing (unrelated — spec-only PR), all other checks pass

Comment thread specs/003-mtls-transport-security/tasks.md
Update mTLS config delivery from ConfigMap injection to annotation +
env var approach per PR kagenti#405 team review:

- Controller sets kagenti.io/mtls-mode annotation on pod template
  (triggers rolling restart on change, independent of config hash)
- Webhook reads mTLSMode from AgentRuntime CR at pod CREATE time,
  sets MTLS_MODE env var on authbridge container
- Acknowledge Istio mTLS coexistence (PR kagenti#383, kagenti#399, RHAIENG-5467)
- Mark T005 as SUPERSEDED with new annotation-based approach
- Update contracts, data-model, research, REVIEWERS.md

Jira: RHAIENG-4944

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
@varshaprasad96 varshaprasad96 changed the title spec: mTLS transport security for agent communication (003) Feat: [spec] mTLS transport security for agent communication (003) Jun 10, 2026
@varshaprasad96

Copy link
Copy Markdown
Contributor Author

Have updated based on aggregates made in #405. Are we good with getting the spec in?
cc: @r3v5 @pdettori @rh-dnagornuks

@pdettori pdettori merged commit 4b7768a into kagenti:main Jun 10, 2026
15 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants