Skip to content

fix: idms_filecycle race condition on PATH operations#4846

Closed
adalrsjr1 wants to merge 1 commit intomainfrom
asampaio/ARO-25884-fix
Closed

fix: idms_filecycle race condition on PATH operations#4846
adalrsjr1 wants to merge 1 commit intomainfrom
asampaio/ARO-25884-fix

Conversation

@adalrsjr1
Copy link
Copy Markdown
Collaborator

refer to ARO-25884

The idms_lifecycle.go E2E test fails because it sends PATCH requests without waiting for the cluster to return a ready state. The cluster-service rejects these updates when the cluster is in updating or pending_update state. This fix ensure the PATCH request only happens when the cluster has transitioned from updating to succeeded

The  E2E test fails because it sends PATCH requests without waiting for the cluster to return to a ready state. The cluster-service backend rejects these updates when the cluster is in  or  state. This fix ensure the PATCH request only happens when the cluster has transitioned from  to
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adalrsjr1
Once this PR has been reviewed and has the lgtm label, please assign patriksuba for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 10, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

}, 1*time.Minute, 15*time.Second).Should(Succeed(), "ImageDigestMirrorSet CRDs should exist on the hosted cluster")

// ensures the cluster is ready for updates before sending PATCH requests, eliminating the race
By("waiting for cluster to reach terminal provisioning state before update")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster should be in ProvisioningState.Succeeded at this point. The error noted is a clusters-service state, likely because the cluster control plane version is upgraded immediately after its created.

  "error": {
        "code": "InvalidRequestContent",
        "message": "Cluster '2pighi4rnqotkalahsrjl39ckd3trbul' is in state 'pending_update', can't update"

The best work around I can think of is to retry the patch until it succeeds but the fact this is happening might be a bug in the RP/CS. I don't think a control plane update should block a customer action like this. @bennerv wdyt?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely because the cluster control plane version is upgraded immediately after its created.
I don't think a control plane update should block a customer action like this.

Are you referring to CP z-stream upgrades? If so, they don't block and they shouldn't block cluster operations. Neither does the RP, nor does CS change the state of the cluster when z-stream upgrades are initiated.

If this happened (e.g observed via a kusto / must-gather) then let's fix it.

That being said, a possible angle to look at as a likely cause of the issue is that the e2e issue the PUT request, it was partially processed by the frontend (i.e it sent the sync API corresponding call to CS, CS processed it), but before the request was fully processed by the frontend ( the frontend or one of the components involved in the processing chain crashed/panic/oom Killed or some internal error) which resulted onto a 5xx being sent back to ARM. When the SDK receives the 5xx it attempted the retry of that APi call with a second call which got the 400 http error from CS since the previous update one was already inflight. If this was also a likely cause. Kusto logs / must-gather can reveal the pattern. And if so, find out the next steps:

  • which component failed and why; if due to memory or any other infra setup like liveness probes, fix them
  • ..likely a long term fix needed will be to also move the trigger of the CS API call to the RP backend (@mbarnes has this on his radar) while CS is still around.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was also a likely cause. Kusto logs / must-gather can reveal the pattern. And if so, find out the next steps:

These kusto logs shows the pattern I was describing above

31/03/2026, 15:26:16.444	request received	patch	/subscriptions/974ebd46-8ad3-41e3-afef-7ef25fd5c371/resourcegroups/idms-75hl98/providers/microsoft.redhatopenshift/hcpopenshiftclusters/idms-e2e-hcp-cluster	api-version=2025-12-23-preview	
31/03/2026, 15:26:16.977	response complete	patch	/subscriptions/974ebd46-8ad3-41e3-afef-7ef25fd5c371/resourcegroups/idms-75hl98/providers/microsoft.redhatopenshift/hcpopenshiftclusters/idms-e2e-hcp-cluster		500
31/03/2026, 15:26:17.943	request received	patch	/subscriptions/974ebd46-8ad3-41e3-afef-7ef25fd5c371/resourcegroups/idms-75hl98/providers/microsoft.redhatopenshift/hcpopenshiftclusters/idms-e2e-hcp-cluster	api-version=2025-12-23-preview	
31/03/2026, 15:26:18.297	response complete	patch	/subscriptions/974ebd46-8ad3-41e3-afef-7ef25fd5c371/resourcegroups/idms-75hl98/providers/microsoft.redhatopenshift/hcpopenshiftclusters/idms-e2e-hcp-cluster		400

Copy link
Copy Markdown
Collaborator Author

@adalrsjr1 adalrsjr1 Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx @tony-schndr and @machi1990 for the insights.

which component failed and why; if due to memory or any other infra setup like liveness probes, fix them

No infra failure. Cosmos DB transaction failed with 412 Precondition Failed:

  1. Frontend read cluster document from Cosmos (ETag: X)
  2. Backend controllers updated the same document during processing (ETag: X → Y)
  3. Frontend tried to write with stale ETag X → 412 Precondition Failed

Log entry (prow-j0316544-svc-aro-hcp-aro-hcp-frontend.jsonl):

{
  "time": "2026-03-31T15:26:16.9703825Z",
  "level": "ERROR",
  "msg": "server request error",
  "err": "(wrapped at cluster.go:684) transaction step 2 of 2 failed with 412 Precondition Failed",
  "request_id": "14a1c69b-dcbd-4cd6-8546-0cdd41150c4e",
  "api_version": "2025-12-23-preview",
  "method": "patch"
}

HTTP Response:

{
  "response_status_code": 500,
  "error": {
    "code": "InternalServerError",
    "message": "Internal server error."
  }
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adalrsjr1 thanks. I don't see it i your comment above but the

"err": "(wrapped at cluster.go:684) transaction step 2 of 2 failed with 412 Precondition Failed",

Is due to the ETAG because the backend updated the resource and the frontend was operating on a stale one.

I think there are couple of things to look at here:

  • moving the cs call to the backend during update (I left a similar comment in the initial PR's review) @Matthew Barnes has this on this ladder
  • How we can handle retries in the frontend in case the similar pattern arises and we get a 412?
    • I guess with the cs call moved to the backend, the frontend can retry safely by loading the fresh copy of the object
  • This can happen to other resources nodepool, external auth as well so we need a broader solution

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no objections, I’ll implement @tony-schndr’s suggestion to retry the update in the test. This serves as a hotfix for the intermittent issue while @mbarnes investigates the larger CS-to-Backend move.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adalrsjr1 the SDK already auto-retry;
What you see here #4846 (comment) is the typical retry behaviour when 500 was received. So it depends on how you want to retry as it'll keep on failing until the cluster moves back to ready state in CS

@adalrsjr1
Copy link
Copy Markdown
Collaborator Author

adalrsjr1 commented Apr 16, 2026

closing this PR in favor of ARO-24384

@adalrsjr1 adalrsjr1 closed this Apr 16, 2026
@adalrsjr1 adalrsjr1 deleted the asampaio/ARO-25884-fix branch April 16, 2026 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants