Feature: 2-layer config merge for AgentRuntime controller by r3v5 · Pull Request #405 · kagenti/kagenti-operator

r3v5 · 2026-06-05T12:57:01Z

Summary

Remove CR spec fields (type, identity, authBridgeMode, mtlsMode) from the controller's config hash computation
Config hash now reflects only platform-level configuration: cluster defaults + namespace defaults
CR-level overrides are still read by the webhook at pod CREATE time (separate story RHAIENG-4936)
Delete ComputeDefaultsOnlyHash() — now identical to ComputeConfigHash()

Note: spec.skills was already removed from the AgentRuntime CRD by PR #388. This PR removes the remaining CR fields from the hash and cleans up related tests/docs.

Jira

[RHAIENG-4935]

Motivation

Part of [RHAIENG-4931] epic to simplify AgentRuntime to a minimal enrollment mechanism. The controller should only hash platform config — CR spec changes should not trigger rolling updates via the hash.

Changes

Commit	Scope	Files
`feat:`	Core implementation	`agentruntime_config.go`, `agentruntime_controller.go`, `defaults_config_reconciler.go`
`test:`	Unit tests	`agentruntime_config_test.go`, `agentruntime_controller_test.go`
`test:`	E2E tests	`e2e_test.go`
`docs:`	Documentation	`architecture.md`, `controller-webhook-interaction.md`, `authbridge-webhook.md`, `api-reference.md`, `GETTING_STARTED.md`, `e2e/README.md`

One-time rollout on upgrade

Every existing workload will get a rolling update when this deploys, because old 3-layer hash ≠ new 2-layer hash. This is expected and acceptable.

Test plan

go test ./internal/controller/... — all pass
go test ./internal/webhook/... — all pass
make lint — no new lint issues
Manual testing on ROSA OpenShift 4.19.18 — 10/10 assertions pass:
- CR spec changes (type, mtlsMode, authBridgeMode) do NOT change hash
- Namespace ConfigMap changes DO change hash (only in that namespace)
- Cluster ConfigMap changes DO change hash (all namespaces)
- CR deletion does NOT change hash
- Hash is deterministic (3 consecutive reads identical)

RHAIENG-4935 Manual Verification — OpenShift

Date: 2026-06-05
ROSA version: OpenShift 4.19.18
Tester: Ian Miller

Precondition check

$ oc get agentruntime -n ocp-agent-test
NAME      TYPE    TARGET       PHASE    AGE
test-rt   agent   test-agent   Active   3m38s

$ oc get agentruntime -n ocp-agent-test-2
NAME        TYPE    TARGET         PHASE    AGE
test-rt-2   agent   test-agent-2   Active   3m38s

$ oc get pods -n ocp-agent-test
NAME                          READY   STATUS    RESTARTS   AGE
test-agent-6b85897b9b-dh69v   1/1     Running   0          6m

$ oc get pods -n ocp-agent-test-2
NAME                            READY   STATUS    RESTARTS   AGE
test-agent-2-86b7f8bc4f-zc47b   1/1     Running   0          6m4s

$ oc get deployment test-agent -n ocp-agent-test -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3

$ oc get deployment test-agent-2 -n ocp-agent-test-2 -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3

Both namespaces have identical base hash: e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3

Test 2: CR spec.type change should NOT change hash

$ oc patch agentruntime test-rt -n ocp-agent-test --type=merge -p '{"spec":{"type":"tool"}}'
agentruntime.agent.kagenti.dev/test-rt patched

$ sleep 20

$ oc get deployment test-agent -n ocp-agent-test -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3

PASS: Hash unchanged (e18db36b... = e18db36b...)

Test 3: CR spec.mtlsMode change should NOT change hash

$ oc patch agentruntime test-rt -n ocp-agent-test --type=merge -p '{"spec":{"mtlsMode":"strict"}}'
agentruntime.agent.kagenti.dev/test-rt patched

$ sleep 20

$ oc get deployment test-agent -n ocp-agent-test -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3

PASS: Hash unchanged (e18db36b... = e18db36b...)

Test 4: CR spec.authBridgeMode change should NOT change hash

$ oc patch agentruntime test-rt -n ocp-agent-test --type=merge -p '{"spec":{"authBridgeMode":"lite"}}'
agentruntime.agent.kagenti.dev/test-rt patched

$ sleep 20

$ oc get deployment test-agent -n ocp-agent-test -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3

PASS: Hash unchanged (e18db36b... = e18db36b...)

Test 5: Namespace ConfigMap should change hash in that namespace ONLY (AC4)

$ oc apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: ns-test-defaults
  namespace: ocp-agent-test
  labels:
    kagenti.io/defaults: "true"
data:
  otel-endpoint: "ns-collector:4317"
EOF
configmap/ns-test-defaults created

$ sleep 20

$ oc get deployment test-agent -n ocp-agent-test -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
6253a8fa9087b0318c564e483dfcaefd9cb525674c40bb8815b3ea62cc0c8c95

$ oc get deployment test-agent-2 -n ocp-agent-test-2 -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3

PASS: NS1 hash changed (e18db36b... -> 6253a8fa...)
PASS: NS2 hash unchanged (e18db36b... = e18db36b...)

Test 6: Cluster ConfigMap edit should change hash in ALL namespaces (AC3)

$ oc patch configmap kagenti-platform-config -n kagenti-system --type=merge -p '{"data":{"test-key":"ocp-manual-test"}}'
configmap/kagenti-platform-config patched

$ sleep 20

$ oc get deployment test-agent -n ocp-agent-test -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
1f565f3e9b3ca02f8e50db7a01ff19ccc45124dea66b029b370fca910a6b7c6c

$ oc get deployment test-agent-2 -n ocp-agent-test-2 -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
8acd90ac64edf9bde20c4d9a32838bc795278e153dbf236c52b31687630bb83f

PASS: NS1 hash changed (6253a8fa... -> 1f565f3e...)
PASS: NS2 hash changed (e18db36b... -> 8acd90ac...)

Test 7: CR deletion should NOT change hash

$ oc delete agentruntime test-rt -n ocp-agent-test
agentruntime.agent.kagenti.dev "test-rt" deleted from ocp-agent-test namespace

$ sleep 20

$ oc get deployment test-agent -n ocp-agent-test -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
1f565f3e9b3ca02f8e50db7a01ff19ccc45124dea66b029b370fca910a6b7c6c

PASS: Hash unchanged after CR deletion (1f565f3e... = 1f565f3e...)

Test 8: Config hash is deterministic (AC5)

$ oc get deployment test-agent-2 -n ocp-agent-test-2 -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
8acd90ac64edf9bde20c4d9a32838bc795278e153dbf236c52b31687630bb83f

$ oc get deployment test-agent-2 -n ocp-agent-test-2 -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
8acd90ac64edf9bde20c4d9a32838bc795278e153dbf236c52b31687630bb83f

$ oc get deployment test-agent-2 -n ocp-agent-test-2 -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
8acd90ac64edf9bde20c4d9a32838bc795278e153dbf236c52b31687630bb83f

PASS: 3 consecutive reads identical (8acd90ac...)

Full Acceptance Criteria Summary

AC	Description	Tests	Result
1	Controller resolves config from exactly two layers: global and namespace	Tests 1, 5, 6	PASS
2	No per-AgentRuntime config overrides are read or applied	Tests 2, 3, 4	PASS
3	Changing global config triggers rolling update of ALL affected workloads	Test 6	PASS
4	Changing namespace config triggers rolling update of workloads in that namespace only	Test 5	PASS
5	Config hash is deterministic — same inputs always produce same hash	Test 8	PASS

Results: 10 passed, 0 failed

Signed-off-by: Ian Miller milleryan2003@gmail.com
🤖 Co-Authored with Claude Code

cwiklik

Exemplary refactor implementing the RHAIENG-4931 "minimal enrollment" direction: the controller's config hash now reflects only platform config (cluster + namespace defaults + feature gates + authbridge-runtime CM), CR spec fields are excluded, and the redundant ComputeDefaultsOnlyHash is collapsed into ComputeConfigHash. All callers (Reconcile, handleDeletion, defaults_config_reconciler) are updated; the empty-hash-on-error semantics are preserved in handleDeletion. Tests and 6 docs are thoroughly and consistently updated 3-layer→2-layer, and the unused agentv1alpha1 import is removed.

Cross-PR note (comment #1): this interacts with #408 — the two PRs move mtlsMode in opposite directions (#408 made it hash-affecting via the default; #405 removes it from the hash). Both are intentional, but the rollout-propagation path for CR-only overrides should be nailed down in RHAIENG-4936.

Test-quality nit (not inline): the renamed unit tests "should NOT change when spec type/identity/MTLSMode/AuthBridgeMode changes" are now tautological — they call ComputeConfigHash twice with identical args (the spec param is gone), so they can't vary the field. The real guarantee is well-covered by the controller-level "same hash as a minimal AgentRuntime" test and E2E; these could be collapsed.

Upgrade impact: correctly disclosed — every workload rolls once on deploy (old 3-layer hash ≠ new 2-layer hash).

4 commits, all signed-off, cleanly split feat/test/test/docs. All 16 CI checks green.

cwiklik · 2026-06-08T13:40:02Z

+// It captures the 2-layer merge of cluster defaults → namespace defaults.
+// CR-level fields (type, identity, skills, etc.) are NOT included — the
+// webhook reads those at pod CREATE time (RHAIENG-4936).
 type resolvedConfig struct {


suggestion: Removing MTLSMode/AuthBridgeMode from the hash means a CR-level edit to these no longer triggers a rollout. But per the field doc just added in #408, authbridge can't hot-reload mTLS — so a CR-only mtlsMode change (e.g. tightening disabled→strict) won't reach running pods until something else rolls them (a platform-config change or manual restart). That's the intended enrollment-only direction, but it interacts directly with #408. Worth confirming RHAIENG-4936 closes the loop on how a CR-only override propagates to running pods — otherwise there's a window where a user edits mtlsMode and observes no effect on existing pods.

varshaprasad96 · 2026-06-08T20:26:53Z

A suggestion on this - when we were discussing the ADR we wanted mTLS to be mandatory as a replacement to Agent Card signing, which definitely makes sense and +100 on that. That is the eventual goal, but for the sake of migration, we need to support a way where users can opt in. This is so that for a few releases, we enable mTLS by default, have opt out approach for the sake of migration and eventually remove the knob for making any changes. This is what was proposed in #401 and is open to discussion.

Considering that - the current 2 layer approach makes sense for the platform config in this PR. But any mTLS changes on AR still need to trigger a pod restart since authbridge reads its mTLS config at startup (no hot reload for TLS listener mode).

With mTLS removed from hash, changing it from disabled to strict has no effect on running pods, until they restart for some other reason.

In order to go over this, can we have the controller set a separate annotation on pod template (like kagenti.io/mtls-mode) that tracks the CR's mTLS mode. When it changes, the pod template changes, triggering a rolling restart independent of the config hash. This keeps the platform level hash clean, while ensuring mTLS mode changes take effect. Eventually we make it mandatory, and remove this annotation.

cc: @rhuss @r3v5 @rh-dnagornuks

The AgentRuntime controller config hash now reflects only platform-level configuration (cluster + namespace ConfigMaps). CR spec fields (type, identity, authBridgeMode, mtlsMode, skills) are excluded — the webhook reads those at pod CREATE time. Delete ComputeDefaultsOnlyHash (now identical to ComputeConfigHash). Remove spec param from ComputeConfigHash and resolveConfig. One-time rollout on upgrade: existing workloads will get a rolling update because old 3-layer hash != new 2-layer hash. RHAIENG-4935 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ian Miller <milleryan2003@gmail.com>

Convert CR-override hash tests to negative assertions (hash should NOT change). Delete ComputeDefaultsOnlyHash tests. Flip deletion hash assertion (hash stays same). Convert skill hash tests to assert no change. RHAIENG-4935 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ian Miller <milleryan2003@gmail.com>

Flip 5 hash assertions: deletion, identity overrides, and skill changes no longer affect the config hash. CR fields are excluded from the 2-layer merge. RHAIENG-4935 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ian Miller <milleryan2003@gmail.com>

r3v5 · 2026-06-09T10:25:15Z

A suggestion on this - when we were discussing the ADR we wanted mTLS to be mandatory as a replacement to Agent Card signing, which definitely makes sense and +100 on that. That is the eventual goal, but for the sake of migration, we need to support a way where users can opt in. This is so that for a few releases, we enable mTLS by default, have opt out approach for the sake of migration and eventually remove the knob for making any changes. This is what was proposed in #401 and is open to discussion.

Considering that - the current 2 layer approach makes sense for the platform config in this PR. But any mTLS changes on AR still need to trigger a pod restart since authbridge reads its mTLS config at startup (no hot reload for TLS listener mode).

With mTLS removed from hash, changing it from disabled to strict has no effect on running pods, until they restart for some other reason.

In order to go over this, can we have the controller set a separate annotation on pod template (like kagenti.io/mtls-mode) that tracks the CR's mTLS mode. When it changes, the pod template changes, triggering a rolling restart independent of the config hash. This keeps the platform level hash clean, while ensuring mTLS mode changes take effect. Eventually we make it mandatory, and remove this annotation.

cc: @rhuss @r3v5 @rh-dnagornuks

Hey @varshaprasad96 ! Sounds reasonable to me

Update all references from 3-layer to 2-layer config merge across architecture, controller-webhook interaction, authbridge webhook, API reference, getting started guide, and e2e README. RHAIENG-4935 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ian Miller <milleryan2003@gmail.com>

r3v5 · 2026-06-09T11:05:48Z

I have just rebased PR to solve merge conflicts. Wait for more opinions regarding @varshaprasad96's comment

rh-dnagornuks · 2026-06-09T11:32:05Z

A suggestion on this - when we were discussing the ADR we wanted mTLS to be mandatory as a replacement to Agent Card signing, which definitely makes sense and +100 on that. That is the eventual goal, but for the sake of migration, we need to support a way where users can opt in. This is so that for a few releases, we enable mTLS by default, have opt out approach for the sake of migration and eventually remove the knob for making any changes. This is what was proposed in #401 and is open to discussion.

Considering that - the current 2 layer approach makes sense for the platform config in this PR. But any mTLS changes on AR still need to trigger a pod restart since authbridge reads its mTLS config at startup (no hot reload for TLS listener mode).

With mTLS removed from hash, changing it from disabled to strict has no effect on running pods, until they restart for some other reason.

In order to go over this, can we have the controller set a separate annotation on pod template (like kagenti.io/mtls-mode) that tracks the CR's mTLS mode. When it changes, the pod template changes, triggering a rolling restart independent of the config hash. This keeps the platform level hash clean, while ensuring mTLS mode changes take effect. Eventually we make it mandatory, and remove this annotation.

cc: @rhuss @r3v5 @rh-dnagornuks

Sounds good to me.

Update mTLS config delivery from ConfigMap injection to annotation + env var approach per PR kagenti#405 team review: - Controller sets kagenti.io/mtls-mode annotation on pod template (triggers rolling restart on change, independent of config hash) - Webhook reads mTLSMode from AgentRuntime CR at pod CREATE time, sets MTLS_MODE env var on authbridge container - Acknowledge Istio mTLS coexistence (PR kagenti#383, kagenti#399, RHAIENG-5467) - Mark T005 as SUPERSEDED with new annotation-based approach - Update contracts, data-model, research, REVIEWERS.md Jira: RHAIENG-4944 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>

pdettori

Confirmed the E2E CI failure (Skill Discovery > should populate linkedSkills from annotation) is not introduced by this PR — the same test fails on main (run 27247825117, merge of PR #401). It's a pre-existing flaky test unrelated to config hash changes.

The 2-layer refactoring is clean: ComputeDefaultsOnlyHash fully removed, all callers updated, assertions correctly flipped. No issues found beyond what the prior review already noted.

r3v5 requested a review from a team as a code owner June 5, 2026 12:57

r3v5 force-pushed the two-layer-config-merge-for-agent-runtime-controller branch 2 times, most recently from fde4022 to fb4edbe Compare June 5, 2026 14:42

clawgenti mentioned this pull request Jun 8, 2026

Weekly Report 2026-06-08 kagenti/kagenti#1840

Open

cwiklik approved these changes Jun 8, 2026

View reviewed changes

varshaprasad96 mentioned this pull request Jun 8, 2026

Feat: [spec] mTLS transport security for agent communication (003) #401

Merged

4 tasks

r3v5 added 3 commits June 9, 2026 11:24

r3v5 force-pushed the two-layer-config-merge-for-agent-runtime-controller branch from fb4edbe to c4df156 Compare June 9, 2026 10:27

r3v5 changed the title ~~feat: 2-layer config merge for AgentRuntime controller~~ Feature: 2-layer config merge for AgentRuntime controller Jun 9, 2026

r3v5 mentioned this pull request Jun 9, 2026

Feat: add observedGeneration; simplify conditions #416

Merged

5 tasks

pdettori approved these changes Jun 10, 2026

View reviewed changes

pdettori merged commit 5b5ca58 into kagenti:main Jun 10, 2026
15 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: 2-layer config merge for AgentRuntime controller#405

Feature: 2-layer config merge for AgentRuntime controller#405
pdettori merged 4 commits into
kagenti:mainfrom
r3v5:two-layer-config-merge-for-agent-runtime-controller

r3v5 commented Jun 5, 2026 •

edited

Loading

Uh oh!

cwiklik left a comment

Uh oh!

cwiklik Jun 8, 2026

Uh oh!

Uh oh!

varshaprasad96 commented Jun 8, 2026

Uh oh!

r3v5 commented Jun 9, 2026

Uh oh!

r3v5 commented Jun 9, 2026

Uh oh!

rh-dnagornuks commented Jun 9, 2026

Uh oh!

pdettori left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

r3v5 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Jira

Motivation

Changes

One-time rollout on upgrade

Test plan

RHAIENG-4935 Manual Verification — OpenShift

Precondition check

Test 2: CR spec.type change should NOT change hash

Test 3: CR spec.mtlsMode change should NOT change hash

Test 4: CR spec.authBridgeMode change should NOT change hash

Test 5: Namespace ConfigMap should change hash in that namespace ONLY (AC4)

Test 6: Cluster ConfigMap edit should change hash in ALL namespaces (AC3)

Test 7: CR deletion should NOT change hash

Test 8: Config hash is deterministic (AC5)

Full Acceptance Criteria Summary

Uh oh!

cwiklik left a comment

Choose a reason for hiding this comment

Uh oh!

cwiklik Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

varshaprasad96 commented Jun 8, 2026

Uh oh!

r3v5 commented Jun 9, 2026

Uh oh!

r3v5 commented Jun 9, 2026

Uh oh!

rh-dnagornuks commented Jun 9, 2026

Uh oh!

pdettori left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

r3v5 commented Jun 5, 2026 •

edited

Loading