fix: idms_filecycle race condition on PATH operations#4846
fix: idms_filecycle race condition on PATH operations#4846
Conversation
The E2E test fails because it sends PATCH requests without waiting for the cluster to return to a ready state. The cluster-service backend rejects these updates when the cluster is in or state. This fix ensure the PATCH request only happens when the cluster has transitioned from to
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: adalrsjr1 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Skipping CI for Draft Pull Request. |
| }, 1*time.Minute, 15*time.Second).Should(Succeed(), "ImageDigestMirrorSet CRDs should exist on the hosted cluster") | ||
|
|
||
| // ensures the cluster is ready for updates before sending PATCH requests, eliminating the race | ||
| By("waiting for cluster to reach terminal provisioning state before update") |
There was a problem hiding this comment.
The cluster should be in ProvisioningState.Succeeded at this point. The error noted is a clusters-service state, likely because the cluster control plane version is upgraded immediately after its created.
"error": {
"code": "InvalidRequestContent",
"message": "Cluster '2pighi4rnqotkalahsrjl39ckd3trbul' is in state 'pending_update', can't update"
The best work around I can think of is to retry the patch until it succeeds but the fact this is happening might be a bug in the RP/CS. I don't think a control plane update should block a customer action like this. @bennerv wdyt?
There was a problem hiding this comment.
likely because the cluster control plane version is upgraded immediately after its created.
I don't think a control plane update should block a customer action like this.
Are you referring to CP z-stream upgrades? If so, they don't block and they shouldn't block cluster operations. Neither does the RP, nor does CS change the state of the cluster when z-stream upgrades are initiated.
If this happened (e.g observed via a kusto / must-gather) then let's fix it.
That being said, a possible angle to look at as a likely cause of the issue is that the e2e issue the PUT request, it was partially processed by the frontend (i.e it sent the sync API corresponding call to CS, CS processed it), but before the request was fully processed by the frontend ( the frontend or one of the components involved in the processing chain crashed/panic/oom Killed or some internal error) which resulted onto a 5xx being sent back to ARM. When the SDK receives the 5xx it attempted the retry of that APi call with a second call which got the 400 http error from CS since the previous update one was already inflight. If this was also a likely cause. Kusto logs / must-gather can reveal the pattern. And if so, find out the next steps:
- which component failed and why; if due to memory or any other infra setup like liveness probes, fix them
- ..likely a long term fix needed will be to also move the trigger of the CS API call to the RP backend (@mbarnes has this on his radar) while CS is still around.
There was a problem hiding this comment.
If this was also a likely cause. Kusto logs / must-gather can reveal the pattern. And if so, find out the next steps:
These kusto logs shows the pattern I was describing above
31/03/2026, 15:26:16.444 request received patch /subscriptions/974ebd46-8ad3-41e3-afef-7ef25fd5c371/resourcegroups/idms-75hl98/providers/microsoft.redhatopenshift/hcpopenshiftclusters/idms-e2e-hcp-cluster api-version=2025-12-23-preview
31/03/2026, 15:26:16.977 response complete patch /subscriptions/974ebd46-8ad3-41e3-afef-7ef25fd5c371/resourcegroups/idms-75hl98/providers/microsoft.redhatopenshift/hcpopenshiftclusters/idms-e2e-hcp-cluster 500
31/03/2026, 15:26:17.943 request received patch /subscriptions/974ebd46-8ad3-41e3-afef-7ef25fd5c371/resourcegroups/idms-75hl98/providers/microsoft.redhatopenshift/hcpopenshiftclusters/idms-e2e-hcp-cluster api-version=2025-12-23-preview
31/03/2026, 15:26:18.297 response complete patch /subscriptions/974ebd46-8ad3-41e3-afef-7ef25fd5c371/resourcegroups/idms-75hl98/providers/microsoft.redhatopenshift/hcpopenshiftclusters/idms-e2e-hcp-cluster 400
There was a problem hiding this comment.
thx @tony-schndr and @machi1990 for the insights.
which component failed and why; if due to memory or any other infra setup like liveness probes, fix them
No infra failure. Cosmos DB transaction failed with 412 Precondition Failed:
- Frontend read cluster document from Cosmos (ETag: X)
- Backend controllers updated the same document during processing (ETag: X → Y)
- Frontend tried to write with stale ETag X → 412 Precondition Failed
Log entry (prow-j0316544-svc-aro-hcp-aro-hcp-frontend.jsonl):
{
"time": "2026-03-31T15:26:16.9703825Z",
"level": "ERROR",
"msg": "server request error",
"err": "(wrapped at cluster.go:684) transaction step 2 of 2 failed with 412 Precondition Failed",
"request_id": "14a1c69b-dcbd-4cd6-8546-0cdd41150c4e",
"api_version": "2025-12-23-preview",
"method": "patch"
}HTTP Response:
{
"response_status_code": 500,
"error": {
"code": "InternalServerError",
"message": "Internal server error."
}
}There was a problem hiding this comment.
@adalrsjr1 thanks. I don't see it i your comment above but the
"err": "(wrapped at cluster.go:684) transaction step 2 of 2 failed with 412 Precondition Failed",
Is due to the ETAG because the backend updated the resource and the frontend was operating on a stale one.
I think there are couple of things to look at here:
- moving the cs call to the backend during update (I left a similar comment in the initial PR's review) @Matthew Barnes has this on this ladder
- How we can handle retries in the frontend in case the similar pattern arises and we get a 412?
- I guess with the cs call moved to the backend, the frontend can retry safely by loading the fresh copy of the object
- This can happen to other resources nodepool, external auth as well so we need a broader solution
There was a problem hiding this comment.
If there are no objections, I’ll implement @tony-schndr’s suggestion to retry the update in the test. This serves as a hotfix for the intermittent issue while @mbarnes investigates the larger CS-to-Backend move.
There was a problem hiding this comment.
@adalrsjr1 the SDK already auto-retry;
What you see here #4846 (comment) is the typical retry behaviour when 500 was received. So it depends on how you want to retry as it'll keep on failing until the cluster moves back to ready state in CS
|
closing this PR in favor of ARO-24384 |
refer to ARO-25884
The
idms_lifecycle.goE2E test fails because it sends PATCH requests without waiting for the cluster to return a ready state. The cluster-service rejects these updates when the cluster is inupdatingorpending_updatestate. This fix ensure the PATCH request only happens when the cluster has transitioned fromupdatingtosucceeded