Skip to content

RFE-9213: enforce machine-config-controller minimum replicas#6144

Open
hmongia12 wants to merge 2 commits into
openshift:mainfrom
hmongia12:cursor/mcc-minimum-replicas-docs-and-constants
Open

RFE-9213: enforce machine-config-controller minimum replicas#6144
hmongia12 wants to merge 2 commits into
openshift:mainfrom
hmongia12:cursor/mcc-minimum-replicas-docs-and-constants

Conversation

@hmongia12
Copy link
Copy Markdown

@hmongia12 hmongia12 commented Jun 4, 2026

Summary

Implements RFE-9213 by ensuring the machine-config-controller Deployment always converges to 1 replica after manual scaling (oc scale to 0 or 2+).

  • Add explicit replicas: 1 to manifests/machineconfigcontroller/deployment.yaml and DefaultMachineConfigControllerReplicas in pkg/controller/common/constants.go.
  • Merge spec.replicas from rendered manifests in lib/resourcemerge.EnsureDeployment (with unit tests).
  • Reconcile replicas in the operator via: early pre-sync in syncAll, post-ApplyDeployment in syncMachineConfigController, and a 30s background loop started before cache sync.
  • Document manual verification steps in docs/HACKING.md for custom operator image testing.

Motivation

machine-config-controller is designed to run as a singleton. If it is scaled to 0, node/MC reconciliation can stall. This change restores the desired replica count through manifest apply, strategic merge patch, and periodic reconcile.

Test plan

  • go test ./lib/resourcemerge/...
  • Deploy custom machine-config-operator image per docs/HACKING.md
  • oc scale deployment/machine-config-controller --replicas=0 → observe spec.replicas return to 1
  • oc scale deployment/machine-config-controller --replicas=2 → observe spec.replicas return to 1
  • Check operator logs for MachineConfigControllerReplicas / periodic reconcile messages

Summary by CodeRabbit

  • Documentation

    • Added verification guide for machine-config-controller replica enforcement behavior, including step-by-step commands and troubleshooting instructions.
  • Bug Fixes

    • Machine-config-controller is now automatically maintained at a minimum of 1 replica, preventing unexpected scale-down to 0 and ensuring consistent service availability.

hmongia12 and others added 2 commits June 4, 2026 15:05
…stant

- Add HACKING.md steps to verify scale-to-0 and scale-to-2 converge to 1.
- Add DefaultMachineConfigControllerReplicas and use it in operator sync paths.
- Clarify EnsureDeployment and manifest replica comments.

Co-authored-by: Cursor <cursoragent@cursor.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 4, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Jun 4, 2026

@hmongia12: This pull request references RFE-9213 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the feature request to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Implements RFE-9213 by ensuring the machine-config-controller Deployment always converges to 1 replica after manual scaling (oc scale to 0 or 2+).

  • Add explicit replicas: 1 to manifests/machineconfigcontroller/deployment.yaml and DefaultMachineConfigControllerReplicas in pkg/controller/common/constants.go.
  • Merge spec.replicas from rendered manifests in lib/resourcemerge.EnsureDeployment (with unit tests).
  • Reconcile replicas in the operator via: early pre-sync in syncAll, post-ApplyDeployment in syncMachineConfigController, and a 30s background loop started before cache sync.
  • Document manual verification steps in docs/HACKING.md for custom operator image testing.

Motivation

machine-config-controller is designed to run as a singleton. If it is scaled to 0, node/MC reconciliation can stall. This change restores the desired replica count through manifest apply, strategic merge patch, and periodic reconcile.

Test plan

  • go test ./lib/resourcemerge/...
  • Deploy custom machine-config-operator image per docs/HACKING.md
  • oc scale deployment/machine-config-controller --replicas=0 → observe spec.replicas return to 1
  • oc scale deployment/machine-config-controller --replicas=2 → observe spec.replicas return to 1
  • Check operator logs for MachineConfigControllerReplicas / periodic reconcile messages

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Walkthrough

This PR implements minimum replica enforcement for the machine-config-controller Deployment across multiple layers: declaring spec.replicas: 1 in the manifest, updating resource merge logic to synchronize the replica count, adding early and post-apply replica reconciliation in sync operations, and running a background loop in the operator to periodically restore replicas if manual scaling occurs.

Changes

Machine-config-controller replica enforcement

Layer / File(s) Summary
Constants and manifest declaration
pkg/controller/common/constants.go, manifests/machineconfigcontroller/deployment.yaml
New DefaultMachineConfigControllerReplicas constant (int32 = 1) is defined and declared in the Deployment manifest with a comment explaining that both the manifest and operator enforce this replica count.
Resource merge support for replicas
lib/resourcemerge/apps.go, lib/resourcemerge/apps_test.go
EnsureDeployment is updated to merge spec.replicas from the rendered manifest into the existing Deployment when non-nil and different. Includes clarifying comments on drift prevention and new tests (TestEnsureDeploymentReplicasMerge) covering update, omit, and match scenarios with a deploymentFixture helper.
Sync-level early and post-apply enforcement
pkg/operator/sync.go
syncAll now calls syncMachineConfigControllerReplicasEarly at the start to restore replicas before any sync functions run. New helper ensureMachineConfigControllerReplicaCount reads the Deployment, compares current replicas to desired, and applies a strategic merge patch with retry logic when different. syncMachineConfigController also enforces replicas after apply and waits for rollout if either the apply or the scale patch was applied.
Operator background enforcement and initial sync
pkg/operator/operator.go
operator.Run now records stopCh earlier and starts a background wait.Until loop that periodically calls ensureMachineConfigControllerReplicaCount on a 30-second interval. Worker startup is adjusted to compute the workqueue key, start worker goroutines, and enqueue the key once immediately for an initial sync in addition to informer-driven events. Comment on OSImageStream ordering in syncFuncs is clarified.
Verification documentation
docs/HACKING.md
New troubleshooting section "Verifying machine-config-controller minimum replicas (RFE-9213)" provides step-by-step shell commands to validate operator image usage, confirm scale-down/recovery and scale-up/recovery behavior of the controller, and optionally inspect operator logs for replica-enforcement messages.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main objective: enforcing a minimum replica count for the machine-config-controller deployment, which is the primary purpose of the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR contains no Ginkgo tests; only standard Go tests with stable, descriptive, static test names without dynamic content like generated identifiers or timestamps.
Test Structure And Quality ✅ Passed PR adds standard Go unit tests (lib/resourcemerge/apps_test.go) using *testing.T, not Ginkgo tests. Custom check for "Ginkgo test code quality" is not applicable—no Ginkgo tests are added or modified.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. The only test added is a standard Go unit test in apps_test.go using testing.T, not Ginkgo patterns.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. All new tests use standard Go testing (testing.T), and other changes are production code or configuration files. Check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed No new topology-incompatible scheduling constraints introduced. PR enforces hardcoded replica count of 1 (topology-neutral) and adds operator enforcement logic only.
Ote Binary Stdout Contract ✅ Passed PR modifies only operator/library code (replica sync logic) with no stdout writes in process-level code, main(), init(), or OTE test suite setup.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR. Only standard Go unit tests (TestEnsureDeploymentReplicasMerge) were added to lib/resourcemerge/apps_test.go. The check is not applicable.
No-Weak-Crypto ✅ Passed No weak cryptography patterns (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons found in the modified files.
Container-Privileges ✅ Passed No privileged settings, hostPID/hostNetwork/hostIPC, SYS_ADMIN, or allowPrivilegeEscalation added. MCC deployment uses restricted-v2 SCC. Only docs, Go code, and benign replica config changes.
No-Sensitive-Data-In-Logs ✅ Passed All logging statements log only operational data: namespace names, resource names, replica counts, and standard Kubernetes API errors. No sensitive data like passwords, tokens, keys, or PII exposed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hmongia12
Once this PR has been reviewed and has the lgtm label, please assign rishabhsaini for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 4, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 4, 2026

Hi @hmongia12. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/operator/sync.go`:
- Around line 1216-1235: The Deployment GET and Patch calls
(optr.kubeClient.AppsV1().Deployments(ns).Get and .Patch) are using
context.TODO(), so create a bounded context with a timeout (e.g., ctx, cancel :=
context.WithTimeout(parentCtx, timeout)) and use that ctx for the Get and Patch
calls and for the retry closure; ensure you call cancel() (defer cancel()) to
avoid leaks and propagate cancellation into retry.OnError so each API attempt is
time‑bounded. Replace the context.TODO() occurrences in this reconcile path
(including inside the retry lambda) with the new ctx and choose an appropriate
timeout constant.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: aa0f019c-0956-4023-ac5e-e353d29c681c

📥 Commits

Reviewing files that changed from the base of the PR and between 29cbc3e and af22dc0.

📒 Files selected for processing (7)
  • docs/HACKING.md
  • lib/resourcemerge/apps.go
  • lib/resourcemerge/apps_test.go
  • manifests/machineconfigcontroller/deployment.yaml
  • pkg/controller/common/constants.go
  • pkg/operator/operator.go
  • pkg/operator/sync.go

Comment thread pkg/operator/sync.go
Comment on lines +1216 to +1235
d, err := optr.kubeClient.AppsV1().Deployments(ns).Get(context.TODO(), name, metav1.GetOptions{})
if err != nil {
if apierrors.IsNotFound(err) {
return false, nil
}
return false, fmt.Errorf("could not get deployment %s: %w", name, err)
}
cur := int32(1)
if d.Spec.Replicas != nil {
cur = *d.Spec.Replicas
}
if cur == desired {
return false, nil
}
klog.Infof("MachineConfigControllerReplicas: restoring %s/%s spec.replicas from %d to %d", ns, name, cur, desired)
patch := []byte(fmt.Sprintf(`{"spec":{"replicas":%d}}`, desired))
if retryErr := retry.OnError(retry.DefaultRetry, mcoResourceApply.IsApplyErrorRetriable, func() error {
_, err := optr.kubeClient.AppsV1().Deployments(ns).Patch(
context.TODO(), name, types.StrategicMergePatchType, patch, metav1.PatchOptions{},
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add bounded contexts to the new Deployment GET/PATCH calls.

This reconcile path uses context.TODO() for API calls plus retry; without per-call timeouts, slow/hung API calls can block this loop longer than intended.

🔧 Suggested change
 func (optr *Operator) ensureMachineConfigControllerReplicaCount(desired int32) (bool, error) {
 	name := ctrlcommon.ControllerConfigName
 	ns := ctrlcommon.MCONamespace
-	d, err := optr.kubeClient.AppsV1().Deployments(ns).Get(context.TODO(), name, metav1.GetOptions{})
+	getCtx, getCancel := context.WithTimeout(context.Background(), 10*time.Second)
+	defer getCancel()
+	d, err := optr.kubeClient.AppsV1().Deployments(ns).Get(getCtx, name, metav1.GetOptions{})
 	if err != nil {
 		if apierrors.IsNotFound(err) {
 			return false, nil
 		}
 		return false, fmt.Errorf("could not get deployment %s: %w", name, err)
 	}
@@
 	klog.Infof("MachineConfigControllerReplicas: restoring %s/%s spec.replicas from %d to %d", ns, name, cur, desired)
 	patch := []byte(fmt.Sprintf(`{"spec":{"replicas":%d}}`, desired))
 	if retryErr := retry.OnError(retry.DefaultRetry, mcoResourceApply.IsApplyErrorRetriable, func() error {
+		patchCtx, patchCancel := context.WithTimeout(context.Background(), 10*time.Second)
+		defer patchCancel()
 		_, err := optr.kubeClient.AppsV1().Deployments(ns).Patch(
-			context.TODO(), name, types.StrategicMergePatchType, patch, metav1.PatchOptions{},
+			patchCtx, name, types.StrategicMergePatchType, patch, metav1.PatchOptions{},
 		)
 		return err
 	}); retryErr != nil {
 		return false, fmt.Errorf("could not patch deployment %s replicas to %d: %w", name, desired, retryErr)
 	}

As per coding guidelines, **/*.go: context.Context for cancellation and timeouts.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
d, err := optr.kubeClient.AppsV1().Deployments(ns).Get(context.TODO(), name, metav1.GetOptions{})
if err != nil {
if apierrors.IsNotFound(err) {
return false, nil
}
return false, fmt.Errorf("could not get deployment %s: %w", name, err)
}
cur := int32(1)
if d.Spec.Replicas != nil {
cur = *d.Spec.Replicas
}
if cur == desired {
return false, nil
}
klog.Infof("MachineConfigControllerReplicas: restoring %s/%s spec.replicas from %d to %d", ns, name, cur, desired)
patch := []byte(fmt.Sprintf(`{"spec":{"replicas":%d}}`, desired))
if retryErr := retry.OnError(retry.DefaultRetry, mcoResourceApply.IsApplyErrorRetriable, func() error {
_, err := optr.kubeClient.AppsV1().Deployments(ns).Patch(
context.TODO(), name, types.StrategicMergePatchType, patch, metav1.PatchOptions{},
)
getCtx, getCancel := context.WithTimeout(context.Background(), 10*time.Second)
defer getCancel()
d, err := optr.kubeClient.AppsV1().Deployments(ns).Get(getCtx, name, metav1.GetOptions{})
if err != nil {
if apierrors.IsNotFound(err) {
return false, nil
}
return false, fmt.Errorf("could not get deployment %s: %w", name, err)
}
cur := int32(1)
if d.Spec.Replicas != nil {
cur = *d.Spec.Replicas
}
if cur == desired {
return false, nil
}
klog.Infof("MachineConfigControllerReplicas: restoring %s/%s spec.replicas from %d to %d", ns, name, cur, desired)
patch := []byte(fmt.Sprintf(`{"spec":{"replicas":%d}}`, desired))
if retryErr := retry.OnError(retry.DefaultRetry, mcoResourceApply.IsApplyErrorRetriable, func() error {
patchCtx, patchCancel := context.WithTimeout(context.Background(), 10*time.Second)
defer patchCancel()
_, err := optr.kubeClient.AppsV1().Deployments(ns).Patch(
patchCtx, name, types.StrategicMergePatchType, patch, metav1.PatchOptions{},
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/operator/sync.go` around lines 1216 - 1235, The Deployment GET and Patch
calls (optr.kubeClient.AppsV1().Deployments(ns).Get and .Patch) are using
context.TODO(), so create a bounded context with a timeout (e.g., ctx, cancel :=
context.WithTimeout(parentCtx, timeout)) and use that ctx for the Get and Patch
calls and for the retry closure; ensure you call cancel() (defer cancel()) to
avoid leaks and propagate cancellation into retry.OnError so each API attempt is
time‑bounded. Replace the context.TODO() occurrences in this reconcile path
(including inside the retry lambda) with the new ctx and choose an appropriate
timeout constant.

@hmongia12
Copy link
Copy Markdown
Author

Manual verification (test cluster)

Tested with custom linux/amd64 operator image (Dockerfile.localquay.io/rhn-support-hmongia/hm:<tag>). Only deployment/machine-config-operator was updated; machine-config-controller remained on the release image. CVO was scaled to 0 during testing so the payload did not revert the operator image.

Commands (summary)

oc -n openshift-cluster-version scale deployment/cluster-version-operator --replicas=0

# build/push/set image on machine-config-operator only (amd64)
oc -n openshift-machine-config-operator set image \
  deployment/machine-config-operator \
  machine-config-operator="quay.io/rhn-support-hmongia/hm:mco-dev-1778485569"

oc -n openshift-machine-config-operator scale deployment/machine-config-controller --replicas=0
# wait ~30s, then check deploy + logs

Operator image confirmed

$ oc get deploy machine-config-operator -o jsonpath='{.spec.template.spec.containers[?(@.name=="machine-config-operator")].image}'
quay.io/rhn-support-hmongia/hm:mco-dev-1778485569

$ oc get pod -l k8s-app=machine-config-operator -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[?(@.name=="machine-config-operator")].imageID}{"\n"}{end}'
machine-config-operator-557849cd45-9gl47	quay.io/rhn-support-hmongia/hm@sha256:fe7aead90b02618fa7f8d9f2bf7e12ddec7be69b94c3f341dbc7d5dfacc61ec7

Real operator logs (scale MCC to 0 → restore to 1)

I0511 07:29:10.768267       1 sync.go:1184] MachineConfigControllerReplicas: running early replica check
I0511 07:29:10.776117       1 sync.go:1217] Restoring openshift-machine-config-operator/machine-config-controller spec.replicas from 0 to 1
I0511 07:29:10.786264       1 sync.go:1191] MachineConfigControllerReplicas: patched machine-config-controller to desired replica count
I0511 08:14:09.097516       1 sync.go:1190] MachineConfigControllerReplicas: running early replica check
I0511 08:14:09.100665       1 sync.go:1220] MachineConfigControllerReplicas: openshift-machine-config-operator/machine-config-controller spec.replicas ptr=0xc003c61ec0 effective=0 want=1
I0511 08:14:09.100682       1 sync.go:1224] Restoring openshift-machine-config-operator/machine-config-controller spec.replicas from 0 to 1
I0511 08:14:09.109798       1 sync.go:1197] MachineConfigControllerReplicas: patched machine-config-controller to desired replica count
I0511 08:24:09.252897       1 sync.go:1190] MachineConfigControllerReplicas: running early replica check
I0511 08:24:09.259425       1 sync.go:1220] MachineConfigControllerReplicas: openshift-machine-config-operator/machine-config-controller spec.replicas ptr=0xc003dce188 effective=0 want=1
I0511 08:24:09.259447       1 sync.go:1224] Restoring openshift-machine-config-operator/machine-config-controller spec.replicas from 0 to 1
I0511 08:24:09.270781       1 sync.go:1197] MachineConfigControllerReplicas: patched machine-config-controller to desired replica count

Restore on the 08:24:09 burst: early check → patched ≈ 18 ms (API patch only; pod Ready takes longer).

Deployment state after restore

$ oc get deploy machine-config-controller -o jsonpath='{.spec.replicas}{"  gen="}{.metadata.generation}{"  rv="}{.metadata.resourceVersion}{"\n"}'
1  gen=21  rv=12824998
Events:
  Normal  ScalingReplicaSet  deployment-controller  Scaled down replica set machine-config-controller-876d48dbd from 1 to 0
  Normal  ScalingReplicaSet  deployment-controller  Scaled up replica set machine-config-controller-876d48dbd from 0 to 1

Unit tests

go test ./lib/resourcemerge/... -count=1 -run TestEnsureDeploymentReplicasMerge

Notes

  • E0511 ... safety controllerconfig sync ... version (3320406c-dirty) differs from controllerconfig (ab5bf377) is expected with a custom operator vs release ControllerConfig; it does not block the replica patch path (see lines above).
  • Scale 0 → 1 was exercised on cluster. Scale 2 → 1 is covered by code paths + docs/HACKING.md; not pasted in logs here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants