Skip to content

GCP destroy: instance group deletion fails due to dependency ordering with backend services #10584

@kaovilai

Description

@kaovilai

Bug

openshift-install destroy cluster on GCP fails to delete instance groups because the backend service that references them hasn't been deleted yet. The destroy loop retries indefinitely but the dependency is never resolved cleanly, leaving orphaned resources.

Error

WARNING failed to delete instance group tkaovila-260601-wif-fwj68-master-us-east1-c: googleapi: Error 400:
  The instance_group resource 'projects/.../instanceGroups/...-master-us-east1-c' is already being used by
  'projects/.../backendServices/...-apiserver', resourceInUseByAnotherResource

This repeats for all 3 instance groups (one per zone), cycling many times before the backend service is eventually deleted.

Root Cause

In pkg/destroy/gcp/gcp.go lines 204-223, all resource types in stage 3 are destroyed in parallel:

{
    {name: "Instances", execute: o.destroyInstances},
    // ... other resources ...
    {name: "Instance groups", execute: o.destroyInstanceGroups},   // line 215
    {name: "Target TCP Proxies", execute: o.destroyTargetTCPProxies}, // line 216
    {name: "Backend services", execute: o.destroyBackendServices}, // line 217
    // ...
}

GCP enforces a dependency chain: forwarding rules → target TCP proxies → backend services → instance groups. A backend service cannot be deleted while a forwarding rule references it, and an instance group cannot be deleted while a backend service references it.

Since all deletions run concurrently, instance group deletion is attempted before (or simultaneously with) backend service deletion. The GCP API returns 400 resourceInUseByAnotherResource. The retry loop does eventually resolve this (backend service succeeds on attempt N, instance groups succeed on attempt N+1), but it produces many warning logs and can leave orphaned resources if the destroy times out.

Suggested Fix

Split the load balancer resources into dependency-ordered stages:

// Stage 3a: Delete LB frontend (forwarding rules, target proxies)
{
    {name: "Forwarding rules", execute: o.destroyForwardingRules},
    {name: "Target Pools", execute: o.destroyTargetPools},
    {name: "Target TCP Proxies", execute: o.destroyTargetTCPProxies},
},
// Stage 3b: Delete LB backend
{
    {name: "Backend services", execute: o.destroyBackendServices},
    {name: "Health checks", execute: o.destroyHealthChecks},
    {name: "HTTP Health checks", execute: o.destroyHTTPHealthChecks},
},
// Stage 3c: Delete compute resources (instance groups now safe to delete)
{
    {name: "Instances", execute: o.destroyInstances},
    {name: "Disks", execute: o.destroyDisks},
    {name: "Instance groups", execute: o.destroyInstanceGroups},
    // ... remaining resources ...
},

This ensures backend services are fully deleted before instance group deletion is attempted.

Environment

  • openshift-install version: 4.22.0-ec.5 (commit f8a1613fc01e0b5de8ee9906e8aa067ba2fdb98e)
  • Platform: GCP (Workload Identity Federation)
  • Region: us-east1

Orphaned Resources After Destroy

After destroy timed out, these resources remained:

$ gcloud compute backend-services list --filter='name~cluster-name'
tkaovila-260601-wif-fwj68-apiserver

$ gcloud compute instance-groups list --filter='name~cluster-name'
tkaovila-260601-wif-fwj68-master-us-east1-b
tkaovila-260601-wif-fwj68-master-us-east1-c
tkaovila-260601-wif-fwj68-master-us-east1-d

Note

Responses generated with Claude

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions