fix: idms_filecycle race condition on PATH operations by adalrsjr1 · Pull Request #4846 · Azure/ARO-HCP

adalrsjr1 · 2026-04-10T19:55:01Z

The idms_lifecycle.go E2E test fails because it sends PATCH requests without waiting for the cluster to return a ready state. The cluster-service rejects these updates when the cluster is in updating or pending_update state. This fix ensure the PATCH request only happens when the cluster has transitioned from updating to succeeded

The E2E test fails because it sends PATCH requests without waiting for the cluster to return to a ready state. The cluster-service backend rejects these updates when the cluster is in or state. This fix ensure the PATCH request only happens when the cluster has transitioned from to

openshift-ci · 2026-04-10T19:55:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adalrsjr1
Once this PR has been reviewed and has the lgtm label, please assign patriksuba for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

test/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-04-10T19:55:13Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

tony-schndr · 2026-04-10T21:07:11Z

 			}, 1*time.Minute, 15*time.Second).Should(Succeed(), "ImageDigestMirrorSet CRDs should exist on the hosted cluster")

+			// ensures the cluster is ready for updates before sending PATCH requests, eliminating the race
+			By("waiting for cluster to reach terminal provisioning state before update")


The cluster should be in ProvisioningState.Succeeded at this point. The error noted is a clusters-service state, likely because the cluster control plane version is upgraded immediately after its created.

"error": { "code": "InvalidRequestContent", "message": "Cluster '2pighi4rnqotkalahsrjl39ckd3trbul' is in state 'pending_update', can't update"

The best work around I can think of is to retry the patch until it succeeds but the fact this is happening might be a bug in the RP/CS. I don't think a control plane update should block a customer action like this. @bennerv wdyt?

likely because the cluster control plane version is upgraded immediately after its created.
I don't think a control plane update should block a customer action like this.

Are you referring to CP z-stream upgrades? If so, they don't block and they shouldn't block cluster operations. Neither does the RP, nor does CS change the state of the cluster when z-stream upgrades are initiated.

If this happened (e.g observed via a kusto / must-gather) then let's fix it.

That being said, a possible angle to look at as a likely cause of the issue is that the e2e issue the PUT request, it was partially processed by the frontend (i.e it sent the sync API corresponding call to CS, CS processed it), but before the request was fully processed by the frontend ( the frontend or one of the components involved in the processing chain crashed/panic/oom Killed or some internal error) which resulted onto a 5xx being sent back to ARM. When the SDK receives the 5xx it attempted the retry of that APi call with a second call which got the 400 http error from CS since the previous update one was already inflight. If this was also a likely cause. Kusto logs / must-gather can reveal the pattern. And if so, find out the next steps:

which component failed and why; if due to memory or any other infra setup like liveness probes, fix them

..likely a long term fix needed will be to also move the trigger of the CS API call to the RP backend (@mbarnes has this on his radar) while CS is still around.

If this was also a likely cause. Kusto logs / must-gather can reveal the pattern. And if so, find out the next steps:

These kusto logs shows the pattern I was describing above

31/03/2026, 15:26:16.444 request received patch /subscriptions/974ebd46-8ad3-41e3-afef-7ef25fd5c371/resourcegroups/idms-75hl98/providers/microsoft.redhatopenshift/hcpopenshiftclusters/idms-e2e-hcp-cluster api-version=2025-12-23-preview 31/03/2026, 15:26:16.977 response complete patch /subscriptions/974ebd46-8ad3-41e3-afef-7ef25fd5c371/resourcegroups/idms-75hl98/providers/microsoft.redhatopenshift/hcpopenshiftclusters/idms-e2e-hcp-cluster 500 31/03/2026, 15:26:17.943 request received patch /subscriptions/974ebd46-8ad3-41e3-afef-7ef25fd5c371/resourcegroups/idms-75hl98/providers/microsoft.redhatopenshift/hcpopenshiftclusters/idms-e2e-hcp-cluster api-version=2025-12-23-preview 31/03/2026, 15:26:18.297 response complete patch /subscriptions/974ebd46-8ad3-41e3-afef-7ef25fd5c371/resourcegroups/idms-75hl98/providers/microsoft.redhatopenshift/hcpopenshiftclusters/idms-e2e-hcp-cluster 400

thx @tony-schndr and @machi1990 for the insights.

which component failed and why; if due to memory or any other infra setup like liveness probes, fix them

No infra failure. Cosmos DB transaction failed with 412 Precondition Failed:

Frontend read cluster document from Cosmos (ETag: X)

Backend controllers updated the same document during processing (ETag: X → Y)

Frontend tried to write with stale ETag X → 412 Precondition Failed

Log entry (prow-j0316544-svc-aro-hcp-aro-hcp-frontend.jsonl):

{ "time": "2026-03-31T15:26:16.9703825Z", "level": "ERROR", "msg": "server request error", "err": "(wrapped at cluster.go:684) transaction step 2 of 2 failed with 412 Precondition Failed", "request_id": "14a1c69b-dcbd-4cd6-8546-0cdd41150c4e", "api_version": "2025-12-23-preview", "method": "patch" }

HTTP Response:

{ "response_status_code": 500, "error": { "code": "InternalServerError", "message": "Internal server error." } }

@adalrsjr1 thanks. I don't see it i your comment above but the

"err": "(wrapped at cluster.go:684) transaction step 2 of 2 failed with 412 Precondition Failed",

Is due to the ETAG because the backend updated the resource and the frontend was operating on a stale one.

I think there are couple of things to look at here:

moving the cs call to the backend during update (I left a similar comment in the initial PR's review) @Matthew Barnes has this on this ladder

How we can handle retries in the frontend in case the similar pattern arises and we get a 412?

I guess with the cs call moved to the backend, the frontend can retry safely by loading the fresh copy of the object

This can happen to other resources nodepool, external auth as well so we need a broader solution

If there are no objections, I’ll implement @tony-schndr’s suggestion to retry the update in the test. This serves as a hotfix for the intermittent issue while @mbarnes investigates the larger CS-to-Backend move.

@adalrsjr1 the SDK already auto-retry;
What you see here #4846 (comment) is the typical retry behaviour when 500 was received. So it depends on how you want to retry as it'll keep on failing until the cluster moves back to ready state in CS

adalrsjr1 · 2026-04-16T15:12:36Z

closing this PR in favor of ARO-24384

adalrsjr1 requested review from raelga and tony-schndr April 10, 2026 19:55

adalrsjr1 self-assigned this Apr 10, 2026

openshift-ci bot added the do-not-merge/work-in-progress label Apr 10, 2026

tony-schndr reviewed Apr 10, 2026

View reviewed changes

adalrsjr1 closed this Apr 16, 2026

adalrsjr1 deleted the asampaio/ARO-25884-fix branch April 16, 2026 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: idms_filecycle race condition on PATH operations#4846

fix: idms_filecycle race condition on PATH operations#4846
adalrsjr1 wants to merge 1 commit intomainfrom
asampaio/ARO-25884-fix

adalrsjr1 commented Apr 10, 2026

Uh oh!

openshift-ci bot commented Apr 10, 2026

Uh oh!

openshift-ci bot commented Apr 10, 2026

Uh oh!

tony-schndr Apr 10, 2026

Uh oh!

machi1990 Apr 10, 2026

Uh oh!

machi1990 Apr 13, 2026

Uh oh!

adalrsjr1 Apr 13, 2026 •

edited

Loading

Uh oh!

machi1990 Apr 13, 2026

Uh oh!

adalrsjr1 Apr 14, 2026

Uh oh!

machi1990 Apr 14, 2026

Uh oh!

adalrsjr1 commented Apr 16, 2026 •

edited by openshift-ci bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adalrsjr1 commented Apr 10, 2026

Uh oh!

openshift-ci bot commented Apr 10, 2026

Uh oh!

openshift-ci bot commented Apr 10, 2026

Uh oh!

tony-schndr Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

machi1990 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

machi1990 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

adalrsjr1 Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machi1990 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

adalrsjr1 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

machi1990 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

adalrsjr1 commented Apr 16, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adalrsjr1 Apr 13, 2026 •

edited

Loading

adalrsjr1 commented Apr 16, 2026 •

edited by openshift-ci bot

Loading