Skip to content

Feature: 2-layer config merge for AgentRuntime controller#405

Merged
pdettori merged 4 commits into
kagenti:mainfrom
r3v5:two-layer-config-merge-for-agent-runtime-controller
Jun 10, 2026
Merged

Feature: 2-layer config merge for AgentRuntime controller#405
pdettori merged 4 commits into
kagenti:mainfrom
r3v5:two-layer-config-merge-for-agent-runtime-controller

Conversation

@r3v5

@r3v5 r3v5 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Remove CR spec fields (type, identity, authBridgeMode, mtlsMode) from the controller's config hash computation
  • Config hash now reflects only platform-level configuration: cluster defaults + namespace defaults
  • CR-level overrides are still read by the webhook at pod CREATE time (separate story RHAIENG-4936)
  • Delete ComputeDefaultsOnlyHash() — now identical to ComputeConfigHash()

Note: spec.skills was already removed from the AgentRuntime CRD by PR #388. This PR removes the remaining CR fields from the hash and cleans up related tests/docs.

Jira

[RHAIENG-4935]

Motivation

Part of [RHAIENG-4931] epic to simplify AgentRuntime to a minimal enrollment mechanism. The controller should only hash platform config — CR spec changes should not trigger rolling updates via the hash.

Changes

Commit Scope Files
feat: Core implementation agentruntime_config.go, agentruntime_controller.go, defaults_config_reconciler.go
test: Unit tests agentruntime_config_test.go, agentruntime_controller_test.go
test: E2E tests e2e_test.go
docs: Documentation architecture.md, controller-webhook-interaction.md, authbridge-webhook.md, api-reference.md, GETTING_STARTED.md, e2e/README.md

One-time rollout on upgrade

Every existing workload will get a rolling update when this deploys, because old 3-layer hash ≠ new 2-layer hash. This is expected and acceptable.

Test plan

  • go test ./internal/controller/... — all pass
  • go test ./internal/webhook/... — all pass
  • make lint — no new lint issues
  • Manual testing on ROSA OpenShift 4.19.18 — 10/10 assertions pass:
    • CR spec changes (type, mtlsMode, authBridgeMode) do NOT change hash
    • Namespace ConfigMap changes DO change hash (only in that namespace)
    • Cluster ConfigMap changes DO change hash (all namespaces)
    • CR deletion does NOT change hash
    • Hash is deterministic (3 consecutive reads identical)

RHAIENG-4935 Manual Verification — OpenShift

Date: 2026-06-05
ROSA version: OpenShift 4.19.18
Tester: Ian Miller

Precondition check

$ oc get agentruntime -n ocp-agent-test
NAME      TYPE    TARGET       PHASE    AGE
test-rt   agent   test-agent   Active   3m38s

$ oc get agentruntime -n ocp-agent-test-2
NAME        TYPE    TARGET         PHASE    AGE
test-rt-2   agent   test-agent-2   Active   3m38s

$ oc get pods -n ocp-agent-test
NAME                          READY   STATUS    RESTARTS   AGE
test-agent-6b85897b9b-dh69v   1/1     Running   0          6m

$ oc get pods -n ocp-agent-test-2
NAME                            READY   STATUS    RESTARTS   AGE
test-agent-2-86b7f8bc4f-zc47b   1/1     Running   0          6m4s

$ oc get deployment test-agent -n ocp-agent-test -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3

$ oc get deployment test-agent-2 -n ocp-agent-test-2 -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3

Both namespaces have identical base hash: e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3

Test 2: CR spec.type change should NOT change hash

$ oc patch agentruntime test-rt -n ocp-agent-test --type=merge -p '{"spec":{"type":"tool"}}'
agentruntime.agent.kagenti.dev/test-rt patched

$ sleep 20

$ oc get deployment test-agent -n ocp-agent-test -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3

PASS: Hash unchanged (e18db36b... = e18db36b...)

Test 3: CR spec.mtlsMode change should NOT change hash

$ oc patch agentruntime test-rt -n ocp-agent-test --type=merge -p '{"spec":{"mtlsMode":"strict"}}'
agentruntime.agent.kagenti.dev/test-rt patched

$ sleep 20

$ oc get deployment test-agent -n ocp-agent-test -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3

PASS: Hash unchanged (e18db36b... = e18db36b...)

Test 4: CR spec.authBridgeMode change should NOT change hash

$ oc patch agentruntime test-rt -n ocp-agent-test --type=merge -p '{"spec":{"authBridgeMode":"lite"}}'
agentruntime.agent.kagenti.dev/test-rt patched

$ sleep 20

$ oc get deployment test-agent -n ocp-agent-test -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3

PASS: Hash unchanged (e18db36b... = e18db36b...)

Test 5: Namespace ConfigMap should change hash in that namespace ONLY (AC4)

$ oc apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: ns-test-defaults
  namespace: ocp-agent-test
  labels:
    kagenti.io/defaults: "true"
data:
  otel-endpoint: "ns-collector:4317"
EOF
configmap/ns-test-defaults created

$ sleep 20

$ oc get deployment test-agent -n ocp-agent-test -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
6253a8fa9087b0318c564e483dfcaefd9cb525674c40bb8815b3ea62cc0c8c95

$ oc get deployment test-agent-2 -n ocp-agent-test-2 -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3

PASS: NS1 hash changed (e18db36b... -> 6253a8fa...)
PASS: NS2 hash unchanged (e18db36b... = e18db36b...)

Test 6: Cluster ConfigMap edit should change hash in ALL namespaces (AC3)

$ oc patch configmap kagenti-platform-config -n kagenti-system --type=merge -p '{"data":{"test-key":"ocp-manual-test"}}'
configmap/kagenti-platform-config patched

$ sleep 20

$ oc get deployment test-agent -n ocp-agent-test -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
1f565f3e9b3ca02f8e50db7a01ff19ccc45124dea66b029b370fca910a6b7c6c

$ oc get deployment test-agent-2 -n ocp-agent-test-2 -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
8acd90ac64edf9bde20c4d9a32838bc795278e153dbf236c52b31687630bb83f

PASS: NS1 hash changed (6253a8fa... -> 1f565f3e...)
PASS: NS2 hash changed (e18db36b... -> 8acd90ac...)

Test 7: CR deletion should NOT change hash

$ oc delete agentruntime test-rt -n ocp-agent-test
agentruntime.agent.kagenti.dev "test-rt" deleted from ocp-agent-test namespace

$ sleep 20

$ oc get deployment test-agent -n ocp-agent-test -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
1f565f3e9b3ca02f8e50db7a01ff19ccc45124dea66b029b370fca910a6b7c6c

PASS: Hash unchanged after CR deletion (1f565f3e... = 1f565f3e...)

Test 8: Config hash is deterministic (AC5)

$ oc get deployment test-agent-2 -n ocp-agent-test-2 -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
8acd90ac64edf9bde20c4d9a32838bc795278e153dbf236c52b31687630bb83f

$ oc get deployment test-agent-2 -n ocp-agent-test-2 -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
8acd90ac64edf9bde20c4d9a32838bc795278e153dbf236c52b31687630bb83f

$ oc get deployment test-agent-2 -n ocp-agent-test-2 -o jsonpath='{.spec.template.metadata.annotations.kagenti\.io/config-hash}' && echo
8acd90ac64edf9bde20c4d9a32838bc795278e153dbf236c52b31687630bb83f

PASS: 3 consecutive reads identical (8acd90ac...)


Full Acceptance Criteria Summary

AC Description Tests Result
1 Controller resolves config from exactly two layers: global and namespace Tests 1, 5, 6 PASS
2 No per-AgentRuntime config overrides are read or applied Tests 2, 3, 4 PASS
3 Changing global config triggers rolling update of ALL affected workloads Test 6 PASS
4 Changing namespace config triggers rolling update of workloads in that namespace only Test 5 PASS
5 Config hash is deterministic — same inputs always produce same hash Test 8 PASS

Results: 10 passed, 0 failed

Signed-off-by: Ian Miller milleryan2003@gmail.com
🤖 Co-Authored with Claude Code

@r3v5 r3v5 requested a review from a team as a code owner June 5, 2026 12:57
@r3v5 r3v5 force-pushed the two-layer-config-merge-for-agent-runtime-controller branch 2 times, most recently from fde4022 to fb4edbe Compare June 5, 2026 14:42

@cwiklik cwiklik left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exemplary refactor implementing the RHAIENG-4931 "minimal enrollment" direction: the controller's config hash now reflects only platform config (cluster + namespace defaults + feature gates + authbridge-runtime CM), CR spec fields are excluded, and the redundant ComputeDefaultsOnlyHash is collapsed into ComputeConfigHash. All callers (Reconcile, handleDeletion, defaults_config_reconciler) are updated; the empty-hash-on-error semantics are preserved in handleDeletion. Tests and 6 docs are thoroughly and consistently updated 3-layer→2-layer, and the unused agentv1alpha1 import is removed.

Cross-PR note (comment #1): this interacts with #408 — the two PRs move mtlsMode in opposite directions (#408 made it hash-affecting via the default; #405 removes it from the hash). Both are intentional, but the rollout-propagation path for CR-only overrides should be nailed down in RHAIENG-4936.

Test-quality nit (not inline): the renamed unit tests "should NOT change when spec type/identity/MTLSMode/AuthBridgeMode changes" are now tautological — they call ComputeConfigHash twice with identical args (the spec param is gone), so they can't vary the field. The real guarantee is well-covered by the controller-level "same hash as a minimal AgentRuntime" test and E2E; these could be collapsed.

Upgrade impact: correctly disclosed — every workload rolls once on deploy (old 3-layer hash ≠ new 2-layer hash).

4 commits, all signed-off, cleanly split feat/test/test/docs. All 16 CI checks green.

// It captures the 2-layer merge of cluster defaults → namespace defaults.
// CR-level fields (type, identity, skills, etc.) are NOT included — the
// webhook reads those at pod CREATE time (RHAIENG-4936).
type resolvedConfig struct {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Removing MTLSMode/AuthBridgeMode from the hash means a CR-level edit to these no longer triggers a rollout. But per the field doc just added in #408, authbridge can't hot-reload mTLS — so a CR-only mtlsMode change (e.g. tightening disabled→strict) won't reach running pods until something else rolls them (a platform-config change or manual restart). That's the intended enrollment-only direction, but it interacts directly with #408. Worth confirming RHAIENG-4936 closes the loop on how a CR-only override propagates to running pods — otherwise there's a window where a user edits mtlsMode and observes no effect on existing pods.

Comment thread kagenti-operator/docs/architecture.md Outdated
@varshaprasad96

Copy link
Copy Markdown
Contributor

A suggestion on this - when we were discussing the ADR we wanted mTLS to be mandatory as a replacement to Agent Card signing, which definitely makes sense and +100 on that. That is the eventual goal, but for the sake of migration, we need to support a way where users can opt in. This is so that for a few releases, we enable mTLS by default, have opt out approach for the sake of migration and eventually remove the knob for making any changes. This is what was proposed in #401 and is open to discussion.

Considering that - the current 2 layer approach makes sense for the platform config in this PR. But any mTLS changes on AR still need to trigger a pod restart since authbridge reads its mTLS config at startup (no hot reload for TLS listener mode).

With mTLS removed from hash, changing it from disabled to strict has no effect on running pods, until they restart for some other reason.

In order to go over this, can we have the controller set a separate annotation on pod template (like kagenti.io/mtls-mode) that tracks the CR's mTLS mode. When it changes, the pod template changes, triggering a rolling restart independent of the config hash. This keeps the platform level hash clean, while ensuring mTLS mode changes take effect. Eventually we make it mandatory, and remove this annotation.

cc: @rhuss @r3v5 @rh-dnagornuks

r3v5 added 3 commits June 9, 2026 11:24
The AgentRuntime controller config hash now reflects only platform-level
configuration (cluster + namespace ConfigMaps). CR spec fields (type,
identity, authBridgeMode, mtlsMode, skills) are excluded — the webhook
reads those at pod CREATE time.

Delete ComputeDefaultsOnlyHash (now identical to ComputeConfigHash).
Remove spec param from ComputeConfigHash and resolveConfig.

One-time rollout on upgrade: existing workloads will get a rolling update
because old 3-layer hash != new 2-layer hash.

RHAIENG-4935

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ian Miller <milleryan2003@gmail.com>
Convert CR-override hash tests to negative assertions (hash should NOT
change). Delete ComputeDefaultsOnlyHash tests. Flip deletion hash
assertion (hash stays same). Convert skill hash tests to assert no
change.

RHAIENG-4935

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ian Miller <milleryan2003@gmail.com>
Flip 5 hash assertions: deletion, identity overrides, and skill changes
no longer affect the config hash. CR fields are excluded from the
2-layer merge.

RHAIENG-4935

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ian Miller <milleryan2003@gmail.com>
@r3v5

r3v5 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

A suggestion on this - when we were discussing the ADR we wanted mTLS to be mandatory as a replacement to Agent Card signing, which definitely makes sense and +100 on that. That is the eventual goal, but for the sake of migration, we need to support a way where users can opt in. This is so that for a few releases, we enable mTLS by default, have opt out approach for the sake of migration and eventually remove the knob for making any changes. This is what was proposed in #401 and is open to discussion.

Considering that - the current 2 layer approach makes sense for the platform config in this PR. But any mTLS changes on AR still need to trigger a pod restart since authbridge reads its mTLS config at startup (no hot reload for TLS listener mode).

With mTLS removed from hash, changing it from disabled to strict has no effect on running pods, until they restart for some other reason.

In order to go over this, can we have the controller set a separate annotation on pod template (like kagenti.io/mtls-mode) that tracks the CR's mTLS mode. When it changes, the pod template changes, triggering a rolling restart independent of the config hash. This keeps the platform level hash clean, while ensuring mTLS mode changes take effect. Eventually we make it mandatory, and remove this annotation.

cc: @rhuss @r3v5 @rh-dnagornuks

Hey @varshaprasad96 ! Sounds reasonable to me

Update all references from 3-layer to 2-layer config merge across
architecture, controller-webhook interaction, authbridge webhook,
API reference, getting started guide, and e2e README.

RHAIENG-4935

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ian Miller <milleryan2003@gmail.com>
@r3v5 r3v5 force-pushed the two-layer-config-merge-for-agent-runtime-controller branch from fb4edbe to c4df156 Compare June 9, 2026 10:27
@r3v5 r3v5 changed the title feat: 2-layer config merge for AgentRuntime controller Feature: 2-layer config merge for AgentRuntime controller Jun 9, 2026
@r3v5

r3v5 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

I have just rebased PR to solve merge conflicts. Wait for more opinions regarding @varshaprasad96's comment

@rh-dnagornuks

Copy link
Copy Markdown
Contributor

A suggestion on this - when we were discussing the ADR we wanted mTLS to be mandatory as a replacement to Agent Card signing, which definitely makes sense and +100 on that. That is the eventual goal, but for the sake of migration, we need to support a way where users can opt in. This is so that for a few releases, we enable mTLS by default, have opt out approach for the sake of migration and eventually remove the knob for making any changes. This is what was proposed in #401 and is open to discussion.

Considering that - the current 2 layer approach makes sense for the platform config in this PR. But any mTLS changes on AR still need to trigger a pod restart since authbridge reads its mTLS config at startup (no hot reload for TLS listener mode).

With mTLS removed from hash, changing it from disabled to strict has no effect on running pods, until they restart for some other reason.

In order to go over this, can we have the controller set a separate annotation on pod template (like kagenti.io/mtls-mode) that tracks the CR's mTLS mode. When it changes, the pod template changes, triggering a rolling restart independent of the config hash. This keeps the platform level hash clean, while ensuring mTLS mode changes take effect. Eventually we make it mandatory, and remove this annotation.

cc: @rhuss @r3v5 @rh-dnagornuks

Sounds good to me.

varshaprasad96 added a commit to varshaprasad96/kagenti-operator that referenced this pull request Jun 10, 2026
Update mTLS config delivery from ConfigMap injection to annotation +
env var approach per PR kagenti#405 team review:

- Controller sets kagenti.io/mtls-mode annotation on pod template
  (triggers rolling restart on change, independent of config hash)
- Webhook reads mTLSMode from AgentRuntime CR at pod CREATE time,
  sets MTLS_MODE env var on authbridge container
- Acknowledge Istio mTLS coexistence (PR kagenti#383, kagenti#399, RHAIENG-5467)
- Mark T005 as SUPERSEDED with new annotation-based approach
- Update contracts, data-model, research, REVIEWERS.md

Jira: RHAIENG-4944

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>

@pdettori pdettori left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed the E2E CI failure (Skill Discovery > should populate linkedSkills from annotation) is not introduced by this PR — the same test fails on main (run 27247825117, merge of PR #401). It's a pre-existing flaky test unrelated to config hash changes.

The 2-layer refactoring is clean: ComputeDefaultsOnlyHash fully removed, all callers updated, assertions correctly flipped. No issues found beyond what the prior review already noted.

@pdettori pdettori merged commit 5b5ca58 into kagenti:main Jun 10, 2026
15 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants