Feature: 2-layer config merge for AgentRuntime controller#405
Conversation
fde4022 to
fb4edbe
Compare
cwiklik
left a comment
There was a problem hiding this comment.
Exemplary refactor implementing the RHAIENG-4931 "minimal enrollment" direction: the controller's config hash now reflects only platform config (cluster + namespace defaults + feature gates + authbridge-runtime CM), CR spec fields are excluded, and the redundant ComputeDefaultsOnlyHash is collapsed into ComputeConfigHash. All callers (Reconcile, handleDeletion, defaults_config_reconciler) are updated; the empty-hash-on-error semantics are preserved in handleDeletion. Tests and 6 docs are thoroughly and consistently updated 3-layer→2-layer, and the unused agentv1alpha1 import is removed.
Cross-PR note (comment #1): this interacts with #408 — the two PRs move mtlsMode in opposite directions (#408 made it hash-affecting via the default; #405 removes it from the hash). Both are intentional, but the rollout-propagation path for CR-only overrides should be nailed down in RHAIENG-4936.
Test-quality nit (not inline): the renamed unit tests "should NOT change when spec type/identity/MTLSMode/AuthBridgeMode changes" are now tautological — they call ComputeConfigHash twice with identical args (the spec param is gone), so they can't vary the field. The real guarantee is well-covered by the controller-level "same hash as a minimal AgentRuntime" test and E2E; these could be collapsed.
Upgrade impact: correctly disclosed — every workload rolls once on deploy (old 3-layer hash ≠ new 2-layer hash).
4 commits, all signed-off, cleanly split feat/test/test/docs. All 16 CI checks green.
| // It captures the 2-layer merge of cluster defaults → namespace defaults. | ||
| // CR-level fields (type, identity, skills, etc.) are NOT included — the | ||
| // webhook reads those at pod CREATE time (RHAIENG-4936). | ||
| type resolvedConfig struct { |
There was a problem hiding this comment.
suggestion: Removing MTLSMode/AuthBridgeMode from the hash means a CR-level edit to these no longer triggers a rollout. But per the field doc just added in #408, authbridge can't hot-reload mTLS — so a CR-only mtlsMode change (e.g. tightening disabled→strict) won't reach running pods until something else rolls them (a platform-config change or manual restart). That's the intended enrollment-only direction, but it interacts directly with #408. Worth confirming RHAIENG-4936 closes the loop on how a CR-only override propagates to running pods — otherwise there's a window where a user edits mtlsMode and observes no effect on existing pods.
|
A suggestion on this - when we were discussing the ADR we wanted mTLS to be mandatory as a replacement to Agent Card signing, which definitely makes sense and +100 on that. That is the eventual goal, but for the sake of migration, we need to support a way where users can opt in. This is so that for a few releases, we enable mTLS by default, have opt out approach for the sake of migration and eventually remove the knob for making any changes. This is what was proposed in #401 and is open to discussion. Considering that - the current 2 layer approach makes sense for the platform config in this PR. But any mTLS changes on AR still need to trigger a pod restart since authbridge reads its mTLS config at startup (no hot reload for TLS listener mode). With mTLS removed from hash, changing it from disabled to strict has no effect on running pods, until they restart for some other reason. In order to go over this, can we have the controller set a separate annotation on pod template (like kagenti.io/mtls-mode) that tracks the CR's mTLS mode. When it changes, the pod template changes, triggering a rolling restart independent of the config hash. This keeps the platform level hash clean, while ensuring mTLS mode changes take effect. Eventually we make it mandatory, and remove this annotation. |
The AgentRuntime controller config hash now reflects only platform-level configuration (cluster + namespace ConfigMaps). CR spec fields (type, identity, authBridgeMode, mtlsMode, skills) are excluded — the webhook reads those at pod CREATE time. Delete ComputeDefaultsOnlyHash (now identical to ComputeConfigHash). Remove spec param from ComputeConfigHash and resolveConfig. One-time rollout on upgrade: existing workloads will get a rolling update because old 3-layer hash != new 2-layer hash. RHAIENG-4935 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ian Miller <milleryan2003@gmail.com>
Convert CR-override hash tests to negative assertions (hash should NOT change). Delete ComputeDefaultsOnlyHash tests. Flip deletion hash assertion (hash stays same). Convert skill hash tests to assert no change. RHAIENG-4935 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ian Miller <milleryan2003@gmail.com>
Flip 5 hash assertions: deletion, identity overrides, and skill changes no longer affect the config hash. CR fields are excluded from the 2-layer merge. RHAIENG-4935 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ian Miller <milleryan2003@gmail.com>
Hey @varshaprasad96 ! Sounds reasonable to me |
Update all references from 3-layer to 2-layer config merge across architecture, controller-webhook interaction, authbridge webhook, API reference, getting started guide, and e2e README. RHAIENG-4935 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ian Miller <milleryan2003@gmail.com>
fb4edbe to
c4df156
Compare
|
I have just rebased PR to solve merge conflicts. Wait for more opinions regarding @varshaprasad96's comment |
Sounds good to me. |
Update mTLS config delivery from ConfigMap injection to annotation + env var approach per PR kagenti#405 team review: - Controller sets kagenti.io/mtls-mode annotation on pod template (triggers rolling restart on change, independent of config hash) - Webhook reads mTLSMode from AgentRuntime CR at pod CREATE time, sets MTLS_MODE env var on authbridge container - Acknowledge Istio mTLS coexistence (PR kagenti#383, kagenti#399, RHAIENG-5467) - Mark T005 as SUPERSEDED with new annotation-based approach - Update contracts, data-model, research, REVIEWERS.md Jira: RHAIENG-4944 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
pdettori
left a comment
There was a problem hiding this comment.
Confirmed the E2E CI failure (Skill Discovery > should populate linkedSkills from annotation) is not introduced by this PR — the same test fails on main (run 27247825117, merge of PR #401). It's a pre-existing flaky test unrelated to config hash changes.
The 2-layer refactoring is clean: ComputeDefaultsOnlyHash fully removed, all callers updated, assertions correctly flipped. No issues found beyond what the prior review already noted.
Summary
ComputeDefaultsOnlyHash()— now identical toComputeConfigHash()Jira
[RHAIENG-4935]
Motivation
Part of [RHAIENG-4931] epic to simplify AgentRuntime to a minimal enrollment mechanism. The controller should only hash platform config — CR spec changes should not trigger rolling updates via the hash.
Changes
feat:agentruntime_config.go,agentruntime_controller.go,defaults_config_reconciler.gotest:agentruntime_config_test.go,agentruntime_controller_test.gotest:e2e_test.godocs:architecture.md,controller-webhook-interaction.md,authbridge-webhook.md,api-reference.md,GETTING_STARTED.md,e2e/README.mdOne-time rollout on upgrade
Every existing workload will get a rolling update when this deploys, because old 3-layer hash ≠ new 2-layer hash. This is expected and acceptable.
Test plan
go test ./internal/controller/...— all passgo test ./internal/webhook/...— all passmake lint— no new lint issuesRHAIENG-4935 Manual Verification — OpenShift
Date: 2026-06-05
ROSA version: OpenShift 4.19.18
Tester: Ian Miller
Precondition check
Both namespaces have identical base hash:
e18db36b7add1d1c6eba0119a7cba9c3ffc9b62c0d3603343a17d18dd3976bb3Test 2: CR spec.type change should NOT change hash
PASS: Hash unchanged (
e18db36b...=e18db36b...)Test 3: CR spec.mtlsMode change should NOT change hash
PASS: Hash unchanged (
e18db36b...=e18db36b...)Test 4: CR spec.authBridgeMode change should NOT change hash
PASS: Hash unchanged (
e18db36b...=e18db36b...)Test 5: Namespace ConfigMap should change hash in that namespace ONLY (AC4)
PASS: NS1 hash changed (
e18db36b...->6253a8fa...)PASS: NS2 hash unchanged (
e18db36b...=e18db36b...)Test 6: Cluster ConfigMap edit should change hash in ALL namespaces (AC3)
PASS: NS1 hash changed (
6253a8fa...->1f565f3e...)PASS: NS2 hash changed (
e18db36b...->8acd90ac...)Test 7: CR deletion should NOT change hash
PASS: Hash unchanged after CR deletion (
1f565f3e...=1f565f3e...)Test 8: Config hash is deterministic (AC5)
PASS: 3 consecutive reads identical (
8acd90ac...)Full Acceptance Criteria Summary
Results: 10 passed, 0 failed
Signed-off-by: Ian Miller milleryan2003@gmail.com
🤖 Co-Authored with Claude Code