Skip to content

NO-JIRA: Fix races in machineset sync units#551

Open
mdbooth wants to merge 3 commits into
openshift:mainfrom
openshift-cloud-team:machinesetsync-races
Open

NO-JIRA: Fix races in machineset sync units#551
mdbooth wants to merge 3 commits into
openshift:mainfrom
openshift-cloud-team:machinesetsync-races

Conversation

@mdbooth
Copy link
Copy Markdown
Contributor

@mdbooth mdbooth commented May 7, 2026

  • Fix test race deleting CAPA machine template
  • Fix incorrect uses of k8sClient.Create in Eventually
  • Fix machineset sync test races between MAPI and CAPI object creation

Summary by CodeRabbit

  • Tests
    • Improved synchronization and reliability of MachineSet reconciliation tests for ClusterAPI scenarios: added post-status-update hooks, explicit waits for informer cache/observed-generation propagation, and more robust deletion/finalizer simulation flows.
    • Centralized and enhanced test resource creation: unified helper now supports create options, consistent dry-run handling, and uniform retry/error patterns across test suites.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@mdbooth: This pull request explicitly references no jira issue.

Details

In response to this:

  • Fix test race deleting CAPA machine template
  • Fix incorrect uses of k8sClient.Create in Eventually
  • Fix machineset sync test races between MAPI and CAPI object creation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7fadb6c8-fcab-423e-822a-ec0ad6d37e6c

📥 Commits

Reviewing files that changed from the base of the PR and between eb108fa and 16ba30f.

📒 Files selected for processing (1)
  • pkg/controllers/machinesetsync/machineset_sync_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/controllers/machinesetsync/machineset_sync_controller_test.go

Walkthrough

The PR enhances MachineSet-sync tests to wait for the controller-runtime manager informer cache (Status.ObservedGeneration and template labels), adds an afterMAPIUpdate deferred hook to control test ordering after MAPI status updates, and changes the kCreate test helper to accept variadic client.CreateOption for option-aware resource creation.

Changes

MachineSet sync tests

Layer / File(s) Summary
Create helper signature
pkg/controllers/machinesetsync/machineset_sync_controller_test.go
kCreate(ctx context.Context, obj client.Object)kCreate(ctx context.Context, obj client.Object, opts ...client.CreateOption); forwards opts to k8sClient.Create enabling option-aware creates in tests.
Manager informer-cache waiter
pkg/controllers/machinesetsync/machineset_sync_controller_test.go
Adds eventuallyManagerInformerCache(obj client.Object) gomegatypes.AsyncAssertion that polls mgr.GetClient().Get(...) until the manager’s informer cache reflects expected object state (used to wait for ObservedGeneration/template label observation).
afterMAPIUpdate setup hook
pkg/controllers/machinesetsync/machineset_sync_controller_test.go
Introduces afterMAPIUpdate func() and invokes it after updating mapiMachineSet.Status.AuthoritativeAPI and Status.ObservedGeneration so deferred test setup runs only once the MAPI status update is visible.
Informer-cache sync usages (CAPI & CAPA)
pkg/controllers/machinesetsync/machineset_sync_controller_test.go
Adds Eventually waits against the manager client in multiple scenarios to ensure capiMachineSet.Status.ObservedGeneration and CAPA machine template label changes are observed before continuing assertions or triggering reconciles.
Deletion & finalizer flow refactor
pkg/controllers/machinesetsync/machineset_sync_controller_test.go
Refactors CAPI deletion tests to establish finalizer preconditions via afterMAPIUpdate; performs deletion in JustBeforeEach; simulates CAPI controller finalizer removal by filtering finalizers with slices.DeleteFunc during cleanup checks.
Test wiring: centralize create calls
pkg/controllers/machinesetsync/machineset_vap_test.go
Replaces direct k8sClient.Create(...) and inline retry logic with Eventually(kCreate(...)) for namespace, AWSCluster/Cluster, templates, MachineSets, and sentinel objects; passes client.DryRunAll where applicable.
Imports / small cleanup
pkg/controllers/machinesetsync/machineset_sync_controller_test.go, pkg/controllers/machinesetsync/machineset_vap_test.go
Adds slices and gomegatypes imports; removes inline AlreadyExists retry scaffolding in favor of kCreate usage.

Sequence Diagram

sequenceDiagram
    participant Test as Test Harness
    participant APIServer as Kubernetes API Server
    participant Manager as ctrl-runtime Manager (informer cache)
    participant Controller as MachineSetSync Controller

    Test->>APIServer: create/update MAPI MachineSet (Status.AuthoritativeAPI, ObservedGeneration)
    Test->>Manager: Eventually(managerInformerCache) wait for observed state
    Manager->>Controller: informer delivers updated object
    Controller->>APIServer: read CAPI MachineSet / CAPA template (reconcile)
    Controller->>APIServer: update/delete resources / finalizers
    Test->>APIServer: simulate CAPI finalizer removal (slices.DeleteFunc)
    Test->>Manager: wait for manager cache to observe finalizer removal
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

approved, lgtm, verified

Suggested reviewers

  • RadekManak
🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'NO-JIRA: Fix races in machineset sync units' accurately describes the main objective of the PR, which is to fix race conditions in the machineset sync controller tests.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in both modified test files are stable and deterministic. Test titles use only static strings with no dynamic values, UUIDs, timestamps, or generated identifiers.
Test Structure And Quality ✅ Passed PR meets test quality requirements: proper cleanup via BeforeEach/AfterEach, timeouts on Eventually calls, eventuallyManagerInformerCache races helper, and single-responsibility test structure.
Microshift Test Compatibility ✅ Passed Tests are unit tests using envtest (local test environment), not e2e tests. They run in isolation, not against a real cluster, so MicroShift compatibility requirements do not apply.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR modifies existing unit tests (not e2e) using envtest. No new Ginkgo e2e tests are added, so SNO check doesn't apply.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only test files (*_test.go) with no changes to deployment manifests or operator code. No scheduling constraints introduced. Fixes test races and infrastructure only.
Ote Binary Stdout Contract ✅ Passed No stdout writes in modified test files. suite_test.go (unmodified) correctly configures klog.SetOutput(GinkgoWriter) in BeforeSuite. All top-level code follows proper Ginkgo v2 patterns.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR modifies controller unit tests in pkg/controllers/machinesetsync/, not e2e tests. The custom check applies only to "new Ginkgo e2e tests." No IPv4 assumptions or external connectivity found.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from damdo and racheljpg May 7, 2026 16:50
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign damdo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mdbooth mdbooth force-pushed the machinesetsync-races branch 2 times, most recently from ee8c828 to eb108fa Compare May 8, 2026 09:58
@mdbooth
Copy link
Copy Markdown
Contributor Author

mdbooth commented May 8, 2026

/test e2e-aws-capi-techpreview

})

Context("when the CAPI machine set has a non-zero deletion timestamp", func() {
Context("when the CAPI machine set is deleted", func() {
Copy link
Copy Markdown
Contributor

@theobarberbany theobarberbany May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why the rename? lol

is non-zero deletion timestamp not more specific to what's going on here?

mapiNamespace = corev1resourcebuilder.Namespace().
WithGenerateName("openshift-machine-api-").Build()
Eventually(k8sClient.Create(ctx, mapiNamespace)).Should(Succeed(), "mapi namespace should be able to be created")
Eventually(kCreate(ctx, mapiNamespace)).Should(Succeed(), "mapi namespace should be able to be created")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 nice on catching these

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably wouldn't have spotted them if they hadn't flaked in a test run, tbh. Not the first time I've made this mistake, either. It's a tricky api.

capiMachineSet.Spec.Template.Spec.Version = testVersion

Eventually(k8sClient.Create(ctx, capiMachineSet), timeout).Should(MatchError(ContainSubstring(".version is a forbidden field")))
Eventually(kCreate(ctx, capiMachineSet, client.DryRunAll), timeout).Should(MatchError(ContainSubstring(".version is a forbidden field")))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noice - this explains a flake I sometimes saw! 🎉

We use this same pattern in the machine sync VAP tests, could we change/ add it there too?

Context("when the MAPI machine set has MachineAuthority set to Cluster API", func() {
// afterMAPIUpdate can be set in a sub-context's BeforeEach to register
// initialisation to happen after the MAPI machine set is updated.
var afterMAPIUpdate func()
Copy link
Copy Markdown
Contributor

@theobarberbany theobarberbany May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern ... I get the intent but makes understanding what code is executing where even worse (layer before each, justbefore each, contexts) then trace this through?

BeforeEach -> Subcontext BeforeEach where this actually gets set -> JustBeforeEach where this gets called 😢

Feels like we're hand rolling a fix to our (probably my) mis-structuring of tests :/ Can we break up the higher contexts so we avoid this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very sympathetic to what you're saying. I was thinking the same. However, as you say the issue is structural: it's the nature of these deeply-nested tests. Fixing this structurally would amount to refactoring the entire test suite. I'm sympathetic to that, too, but it's not likely to happen in the near future.

This was honestly the simplest solution I could come up with without rewriting everything. It at least has a clear name, so it's explicit. If you can think of anything simpler I'd be in favour.

One change I'd be tempted to make to this, though, is to add a comment to explain this was added under duress, and should not be copy-pasta'd or extended. New tests should be structured so this is not necessary. Thoughts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... fair. It would be ugly and you're right.. wont be worked any time soon.

One change I'd be tempted to make to this, though, is to add a comment to explain this was added under duress, and should not be copy-pasta'd or extended. New tests should be structured so this is not necessary. Thoughts?

I think this would be a pragmatic middle ground.

I wonder if we can lint / catch a generalised version of this pattern somehow 🤔

if afterMAPIUpdate != nil {
afterMAPIUpdate()

// afterMAPIUpdate is not reinitialised between tests in this
Copy link
Copy Markdown
Contributor

@theobarberbany theobarberbany May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we start this pattern I think someone / an agent will forget to unset it and we'll spend a day or two chasing our tails

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You worried that an agent might delete this line? Not following...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you addressed this in this comment: #551 (comment)

My above comment wasn't all that clear, I meant if this becomes a pattern that gets copied - but a big do not copy this comment should work 🤞

// behaviour of the controller in the presence of existing state. We need to
// ensure that the controller has had a chance to observe the state the test
// just created first.
eventuallyManagerInformerCache := func(obj client.Object) gomegatypes.AsyncAssertion {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good though :D

@mdbooth mdbooth force-pushed the machinesetsync-races branch from eb108fa to 16ba30f Compare May 12, 2026 10:56
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

@mdbooth: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants