RFE-9213: enforce machine-config-controller minimum replicas#6144
RFE-9213: enforce machine-config-controller minimum replicas#6144hmongia12 wants to merge 2 commits into
Conversation
Related: https://redhat.atlassian.net/browse/RFE-9213 Co-authored-by: Cursor <cursoragent@cursor.com>
…stant - Add HACKING.md steps to verify scale-to-0 and scale-to-2 converge to 1. - Add DefaultMachineConfigControllerReplicas and use it in operator sync paths. - Clarify EnsureDeployment and manifest replica comments. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@hmongia12: This pull request references RFE-9213 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the feature request to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThis PR implements minimum replica enforcement for the machine-config-controller Deployment across multiple layers: declaring ChangesMachine-config-controller replica enforcement
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Command failed Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hmongia12 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @hmongia12. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/operator/sync.go`:
- Around line 1216-1235: The Deployment GET and Patch calls
(optr.kubeClient.AppsV1().Deployments(ns).Get and .Patch) are using
context.TODO(), so create a bounded context with a timeout (e.g., ctx, cancel :=
context.WithTimeout(parentCtx, timeout)) and use that ctx for the Get and Patch
calls and for the retry closure; ensure you call cancel() (defer cancel()) to
avoid leaks and propagate cancellation into retry.OnError so each API attempt is
time‑bounded. Replace the context.TODO() occurrences in this reconcile path
(including inside the retry lambda) with the new ctx and choose an appropriate
timeout constant.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: aa0f019c-0956-4023-ac5e-e353d29c681c
📒 Files selected for processing (7)
docs/HACKING.mdlib/resourcemerge/apps.golib/resourcemerge/apps_test.gomanifests/machineconfigcontroller/deployment.yamlpkg/controller/common/constants.gopkg/operator/operator.gopkg/operator/sync.go
| d, err := optr.kubeClient.AppsV1().Deployments(ns).Get(context.TODO(), name, metav1.GetOptions{}) | ||
| if err != nil { | ||
| if apierrors.IsNotFound(err) { | ||
| return false, nil | ||
| } | ||
| return false, fmt.Errorf("could not get deployment %s: %w", name, err) | ||
| } | ||
| cur := int32(1) | ||
| if d.Spec.Replicas != nil { | ||
| cur = *d.Spec.Replicas | ||
| } | ||
| if cur == desired { | ||
| return false, nil | ||
| } | ||
| klog.Infof("MachineConfigControllerReplicas: restoring %s/%s spec.replicas from %d to %d", ns, name, cur, desired) | ||
| patch := []byte(fmt.Sprintf(`{"spec":{"replicas":%d}}`, desired)) | ||
| if retryErr := retry.OnError(retry.DefaultRetry, mcoResourceApply.IsApplyErrorRetriable, func() error { | ||
| _, err := optr.kubeClient.AppsV1().Deployments(ns).Patch( | ||
| context.TODO(), name, types.StrategicMergePatchType, patch, metav1.PatchOptions{}, | ||
| ) |
There was a problem hiding this comment.
Add bounded contexts to the new Deployment GET/PATCH calls.
This reconcile path uses context.TODO() for API calls plus retry; without per-call timeouts, slow/hung API calls can block this loop longer than intended.
🔧 Suggested change
func (optr *Operator) ensureMachineConfigControllerReplicaCount(desired int32) (bool, error) {
name := ctrlcommon.ControllerConfigName
ns := ctrlcommon.MCONamespace
- d, err := optr.kubeClient.AppsV1().Deployments(ns).Get(context.TODO(), name, metav1.GetOptions{})
+ getCtx, getCancel := context.WithTimeout(context.Background(), 10*time.Second)
+ defer getCancel()
+ d, err := optr.kubeClient.AppsV1().Deployments(ns).Get(getCtx, name, metav1.GetOptions{})
if err != nil {
if apierrors.IsNotFound(err) {
return false, nil
}
return false, fmt.Errorf("could not get deployment %s: %w", name, err)
}
@@
klog.Infof("MachineConfigControllerReplicas: restoring %s/%s spec.replicas from %d to %d", ns, name, cur, desired)
patch := []byte(fmt.Sprintf(`{"spec":{"replicas":%d}}`, desired))
if retryErr := retry.OnError(retry.DefaultRetry, mcoResourceApply.IsApplyErrorRetriable, func() error {
+ patchCtx, patchCancel := context.WithTimeout(context.Background(), 10*time.Second)
+ defer patchCancel()
_, err := optr.kubeClient.AppsV1().Deployments(ns).Patch(
- context.TODO(), name, types.StrategicMergePatchType, patch, metav1.PatchOptions{},
+ patchCtx, name, types.StrategicMergePatchType, patch, metav1.PatchOptions{},
)
return err
}); retryErr != nil {
return false, fmt.Errorf("could not patch deployment %s replicas to %d: %w", name, desired, retryErr)
}As per coding guidelines, **/*.go: context.Context for cancellation and timeouts.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| d, err := optr.kubeClient.AppsV1().Deployments(ns).Get(context.TODO(), name, metav1.GetOptions{}) | |
| if err != nil { | |
| if apierrors.IsNotFound(err) { | |
| return false, nil | |
| } | |
| return false, fmt.Errorf("could not get deployment %s: %w", name, err) | |
| } | |
| cur := int32(1) | |
| if d.Spec.Replicas != nil { | |
| cur = *d.Spec.Replicas | |
| } | |
| if cur == desired { | |
| return false, nil | |
| } | |
| klog.Infof("MachineConfigControllerReplicas: restoring %s/%s spec.replicas from %d to %d", ns, name, cur, desired) | |
| patch := []byte(fmt.Sprintf(`{"spec":{"replicas":%d}}`, desired)) | |
| if retryErr := retry.OnError(retry.DefaultRetry, mcoResourceApply.IsApplyErrorRetriable, func() error { | |
| _, err := optr.kubeClient.AppsV1().Deployments(ns).Patch( | |
| context.TODO(), name, types.StrategicMergePatchType, patch, metav1.PatchOptions{}, | |
| ) | |
| getCtx, getCancel := context.WithTimeout(context.Background(), 10*time.Second) | |
| defer getCancel() | |
| d, err := optr.kubeClient.AppsV1().Deployments(ns).Get(getCtx, name, metav1.GetOptions{}) | |
| if err != nil { | |
| if apierrors.IsNotFound(err) { | |
| return false, nil | |
| } | |
| return false, fmt.Errorf("could not get deployment %s: %w", name, err) | |
| } | |
| cur := int32(1) | |
| if d.Spec.Replicas != nil { | |
| cur = *d.Spec.Replicas | |
| } | |
| if cur == desired { | |
| return false, nil | |
| } | |
| klog.Infof("MachineConfigControllerReplicas: restoring %s/%s spec.replicas from %d to %d", ns, name, cur, desired) | |
| patch := []byte(fmt.Sprintf(`{"spec":{"replicas":%d}}`, desired)) | |
| if retryErr := retry.OnError(retry.DefaultRetry, mcoResourceApply.IsApplyErrorRetriable, func() error { | |
| patchCtx, patchCancel := context.WithTimeout(context.Background(), 10*time.Second) | |
| defer patchCancel() | |
| _, err := optr.kubeClient.AppsV1().Deployments(ns).Patch( | |
| patchCtx, name, types.StrategicMergePatchType, patch, metav1.PatchOptions{}, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/operator/sync.go` around lines 1216 - 1235, The Deployment GET and Patch
calls (optr.kubeClient.AppsV1().Deployments(ns).Get and .Patch) are using
context.TODO(), so create a bounded context with a timeout (e.g., ctx, cancel :=
context.WithTimeout(parentCtx, timeout)) and use that ctx for the Get and Patch
calls and for the retry closure; ensure you call cancel() (defer cancel()) to
avoid leaks and propagate cancellation into retry.OnError so each API attempt is
time‑bounded. Replace the context.TODO() occurrences in this reconcile path
(including inside the retry lambda) with the new ctx and choose an appropriate
timeout constant.
Manual verification (test cluster)Tested with custom linux/amd64 operator image ( Commands (summary)oc -n openshift-cluster-version scale deployment/cluster-version-operator --replicas=0
# build/push/set image on machine-config-operator only (amd64)
oc -n openshift-machine-config-operator set image \
deployment/machine-config-operator \
machine-config-operator="quay.io/rhn-support-hmongia/hm:mco-dev-1778485569"
oc -n openshift-machine-config-operator scale deployment/machine-config-controller --replicas=0
# wait ~30s, then check deploy + logsOperator image confirmedReal operator logs (scale MCC to 0 → restore to 1)Restore on the Deployment state after restoreUnit testsgo test ./lib/resourcemerge/... -count=1 -run TestEnsureDeploymentReplicasMergeNotes
|
Summary
Implements RFE-9213 by ensuring the
machine-config-controllerDeployment always converges to 1 replica after manual scaling (oc scaleto 0 or 2+).replicas: 1tomanifests/machineconfigcontroller/deployment.yamlandDefaultMachineConfigControllerReplicasinpkg/controller/common/constants.go.spec.replicasfrom rendered manifests inlib/resourcemerge.EnsureDeployment(with unit tests).syncAll, post-ApplyDeploymentinsyncMachineConfigController, and a 30s background loop started before cache sync.docs/HACKING.mdfor custom operator image testing.Motivation
machine-config-controlleris designed to run as a singleton. If it is scaled to 0, node/MC reconciliation can stall. This change restores the desired replica count through manifest apply, strategic merge patch, and periodic reconcile.Test plan
go test ./lib/resourcemerge/...machine-config-operatorimage perdocs/HACKING.mdoc scale deployment/machine-config-controller --replicas=0→ observespec.replicasreturn to1oc scale deployment/machine-config-controller --replicas=2→ observespec.replicasreturn to1MachineConfigControllerReplicas/ periodic reconcile messagesSummary by CodeRabbit
Documentation
Bug Fixes