Skip to content

OCPBUGS-86949: Guard HCCO KubeletConfig CM deletion against transient source absence#8672

Open
vsolanki12 wants to merge 1 commit into
openshift:mainfrom
vsolanki12:fix-OCPBUGS-86949
Open

OCPBUGS-86949: Guard HCCO KubeletConfig CM deletion against transient source absence#8672
vsolanki12 wants to merge 1 commit into
openshift:mainfrom
vsolanki12:fix-OCPBUGS-86949

Conversation

@vsolanki12
Copy link
Copy Markdown
Contributor

@vsolanki12 vsolanki12 commented Jun 4, 2026

What this PR does / why we need it:

HCCO's reconcileKubeletConfig deletes guest-side ConfigMaps whose source is absent from the HCP namespace. During the immutable-to-mutable migration (OCPBUGS-85778) or any transient API error, the source CM can be briefly absent, causing HCCO to delete the guest copy. NTO then regenerates MachineConfigs without it, triggering an MCO node rollout.

This PR:

  • Guards deletion of mirrored CMs: Skips deletion of guest-side CMs carrying NTOMirroredConfigLabel, since their source is expected to reappear on the next reconcile cycle
  • Adds ownership guard to deleteImmutableConfigMapIfNeeded: Refactors to use DeleteIfNeededWithPredicate with a KubeletConfigConfigMapLabel ownership check, preventing accidental deletion of unrelated immutable ConfigMaps
  • Clears stale fields after predicate-based delete: Resets ResourceVersion and Immutable after DeleteIfNeededWithPredicate to avoid stale-resourceVersion errors and immutable leakage on the subsequent CreateOrUpdate

Which issue(s) this PR fixes:

Fixes OCPBUGS-86949

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes

    • KubeletConfig ConfigMaps are now properly converted to mutable state during reconciliation.
    • Mirrored ConfigMaps are preserved during cleanup when their source is transiently unavailable.
    • Immutable KubeletConfig ConfigMaps are properly deleted and recreated as mutable.
  • Tests

    • Extended test coverage for KubeletConfig ConfigMap reconciliation scenarios.

…bsence

HCCO's reconcileKubeletConfig deletes guest-side ConfigMaps whose
source is absent from the HCP namespace. During the immutable-to-mutable
migration (OCPBUGS-85778) or any transient API error, the source CM can
be briefly absent, causing HCCO to delete the guest copy. NTO then
regenerates MachineConfigs without it, triggering an MCO node rollout.

Skip deletion of guest-side CMs that carry the NTOMirroredConfigLabel,
since their source is expected to reappear on the next reconcile.

Also refactor deleteImmutableConfigMapIfNeeded to use
DeleteIfNeededWithPredicate with a KubeletConfigConfigMapLabel ownership
guard, and clear ResourceVersion after the predicate-based delete to
avoid stale-resourceVersion errors on the subsequent CreateOrUpdate.

Signed-off-by: Vimal Solanki <vsolanki@redhat.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 4, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-86949, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

HCCO's reconcileKubeletConfig deletes guest-side ConfigMaps whose source is absent from the HCP namespace. During the immutable-to-mutable migration (OCPBUGS-85778) or any transient API error, the source CM can be briefly absent, causing HCCO to delete the guest copy. NTO then regenerates MachineConfigs without it, triggering an MCO node rollout.

This PR:

  • Guards deletion of mirrored CMs: Skips deletion of guest-side CMs carrying NTOMirroredConfigLabel, since their source is expected to reappear on the next reconcile cycle
  • Adds ownership guard to deleteImmutableConfigMapIfNeeded: Refactors to use DeleteIfNeededWithPredicate with a KubeletConfigConfigMapLabel ownership check, preventing accidental deletion of unrelated immutable ConfigMaps
  • Clears stale fields after predicate-based delete: Resets ResourceVersion and Immutable after DeleteIfNeededWithPredicate to avoid stale-resourceVersion errors and immutable leakage on the subsequent CreateOrUpdate

Which issue(s) this PR fixes:

Fixes OCPBUGS-86949

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Jun 4, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 4, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: cfc90e28-a09a-44ee-89d2-9ac82ca4729a

📥 Commits

Reviewing files that changed from the base of the PR and between b07ae90 and 816742a.

📒 Files selected for processing (2)
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go

📝 Walkthrough

Walkthrough

This PR modifies KubeletConfig ConfigMap reconciliation in the hosted cluster operator to address mutability and NTO mirroring concerns. The reconciler now forces ConfigMaps to be mutable before update by clearing resourceVersion and removing the immutable flag. During cleanup, mirrored ConfigMaps—marked with the NTO mirror label—are preserved to avoid deletion when their source ConfigMap is transiently absent. The deleteImmutableConfigMapIfNeeded function was refactored to use a predicate-based delete helper that explicitly targets immutable KubeletConfig-labeled ConfigMaps and logs when they are deleted to allow recreation as mutable. Test coverage was extended with scenarios for mirrored preservation, non-mirrored deletion, and immutable ConfigMap handling, and assertions were updated to verify ConfigMaps are mutable after reconciliation.

🚥 Pre-merge checks | ✅ 4 | ❌ 7

❌ Failed checks (7 inconclusive)

Check name Status Explanation Resolution
Stable And Deterministic Test Names ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Test Structure And Quality ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Topology-Aware Scheduling Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ipv6 And Disconnected Network Test Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Weak-Crypto ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Container-Privileges ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Sensitive-Data-In-Logs ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: guarding KubeletConfig ConfigMap deletion against transient source absence, which directly matches the PR's core objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot
Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-86949, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

What this PR does / why we need it:

HCCO's reconcileKubeletConfig deletes guest-side ConfigMaps whose source is absent from the HCP namespace. During the immutable-to-mutable migration (OCPBUGS-85778) or any transient API error, the source CM can be briefly absent, causing HCCO to delete the guest copy. NTO then regenerates MachineConfigs without it, triggering an MCO node rollout.

This PR:

  • Guards deletion of mirrored CMs: Skips deletion of guest-side CMs carrying NTOMirroredConfigLabel, since their source is expected to reappear on the next reconcile cycle
  • Adds ownership guard to deleteImmutableConfigMapIfNeeded: Refactors to use DeleteIfNeededWithPredicate with a KubeletConfigConfigMapLabel ownership check, preventing accidental deletion of unrelated immutable ConfigMaps
  • Clears stale fields after predicate-based delete: Resets ResourceVersion and Immutable after DeleteIfNeededWithPredicate to avoid stale-resourceVersion errors and immutable leakage on the subsequent CreateOrUpdate

Which issue(s) this PR fixes:

Fixes OCPBUGS-86949

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes

  • KubeletConfig ConfigMaps are now properly converted to mutable state during reconciliation.

  • Mirrored ConfigMaps are preserved during cleanup when their source is transiently unavailable.

  • Immutable KubeletConfig ConfigMaps are properly deleted and recreated as mutable.

  • Tests

  • Extended test coverage for KubeletConfig ConfigMap reconciliation scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Jun 4, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vsolanki12
Once this PR has been reviewed and has the lgtm label, please assign enxebre for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 4, 2026

Codecov Report

❌ Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.44%. Comparing base (25817d4) to head (816742a).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...rconfigoperator/controllers/resources/resources.go 86.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8672   +/-   ##
=======================================
  Coverage   41.43%   41.44%           
=======================================
  Files         756      756           
  Lines       93658    93664    +6     
=======================================
+ Hits        38807    38816    +9     
+ Misses      52128    52125    -3     
  Partials     2723     2723           
Files with missing lines Coverage Δ
...rconfigoperator/controllers/resources/resources.go 56.90% <86.66%> (+0.19%) ⬆️
Flag Coverage Δ
cmd-support 34.87% <ø> (ø)
cpo-hostedcontrolplane 43.50% <ø> (ø)
cpo-other 42.79% <86.66%> (+0.05%) ⬆️
hypershift-operator 51.56% <ø> (ø)
other 31.64% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vsolanki12 vsolanki12 marked this pull request as ready for review June 4, 2026 15:57
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 4, 2026
@openshift-ci openshift-ci Bot requested review from cblecker and sdminonne June 4, 2026 15:58
@cblecker
Copy link
Copy Markdown
Member

cblecker commented Jun 4, 2026

/uncc

@openshift-ci openshift-ci Bot removed the request for review from cblecker June 4, 2026 16:08
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 4, 2026

@vsolanki12: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@vsolanki12
Copy link
Copy Markdown
Contributor Author

vsolanki12 commented Jun 4, 2026

I have tried to reproduce and tested the fix in my test cluster as below:

Environment: KubeVirt HCP cluster vsolankihcp, version 5.0.0-ec.1, 2-node NodePool

Custom CPO image deployed:

  $ oc get pod hosted-cluster-config-operator-7df85759cc-k25pl -n clusters-vsolankihcp \
      -o jsonpath='{.spec.containers[*].image}'
  quay.io/vsolanki/control-plane-operator:OCPBUGS-86949

Test scenario: Deleted source KubeletConfig CM from HCP namespace to simulate transient absence during immutable-to-mutable migration (OCPBUGS-85778).

Before fix: guest-side CM immediately removed:

  $ oc get configmap kubelet-config-maxpods-vsolankihcp -n openshift-config-managed \
      --kubeconfig=/tmp/vsolankihcp-guest.kubeconfig
  Error from server (NotFound): configmaps "kubelet-config-maxpods-vsolankihcp" not found

After fix: guest side CM preserved:

  $ oc get configmap kubelet-config-maxpods-vsolankihcp -n openshift-config-managed \
      --kubeconfig=/tmp/vsolankihcp-guest.kubeconfig
  NAME                                  DATA   AGE
  kubelet-config-maxpods-vsolankihcp    1      2m29s

HCCO logs guard active:

  {"level":"info","ts":"2026-06-04T15:17:18Z",
   "msg":"skipping deletion of mirrored ConfigMap with transiently absent source",
   "controller":"resources",
   "configMap":"openshift-config-managed/kubelet-config-maxpods-vsolankihcp"}

@vsolanki12
Copy link
Copy Markdown
Contributor Author

/cc @jparrill

@openshift-ci openshift-ci Bot requested a review from jparrill June 4, 2026 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants