Skip to content

GCP CAPI: bootstrap in master instance group causes worker ignition failure via ILB pinning #10590

@kaovilai

Description

@kaovilai

Bug

Worker nodes never join GCP CAPI-based OCP 4.22 clusters. Workers are provisioned (GCP VMs running) but stay stuck in an ignition fetch loop indefinitely, receiving HTTP 500 from the Machine Config Server (MCS).

Root Cause

In GCP CAPI installs, the installer places the bootstrap node in the same unmanaged instance group as master-0 (zone-a/b). Both are backends for the GCP Internal Load Balancer (ILB) on ports 6443 and 22623.

The GCP ILB uses connection-based session affinity (CONNECTION mode) — once a worker's TCP connection is established to a backend, it stays pinned for the connection lifetime.

During the critical window when workers first boot and fetch ignition config from api-int:22623, the bootstrap node is the healthiest/first-responding backend. All worker connections get pinned to bootstrap.

Bootstrap MCS is designed to refuse worker ignition requests — it only serves master configs:

refusing to serve bootstrap configuration to pool "worker"

Workers receive HTTP 500 forever. The in-cluster MCS running on master-0 receives zero worker requests despite being healthy and ready.

Evidence

  • Worker serial console: 900+ ignition GET attempts, all returning Internal Server Error
  • Bootstrap MCS logs: explicit refusing to serve bootstrap configuration to pool "worker"
  • In-cluster MCS logs: zero worker requests despite running 40+ minutes
  • Zero worker CSRs in the cluster
  • gcloud compute instance-groups unmanaged list-instances confirms bootstrap and master-0 in same group

Workaround

Remove bootstrap from the master instance group after masters are ready:

gcloud compute instance-groups unmanaged remove-instances \
  <cluster-id>-master-<zone> \
  --zone=<zone> \
  --instances=<cluster-id>-bootstrap \
  --project=<project>

All 3 workers completed ignition within minutes and joined the cluster after applying this.

Suggested Fix

The installer's GCP CAPI code should either:

  1. Give bootstrap its own separate instance group (as Terraform/UPI installs did), or
  2. Remove bootstrap from the master instance group once masters are healthy and in-cluster MCS is serving

Environment

  • OCP: 4.22.0-rc.5 (also reproduced on 4.22.0-rc.4)
  • Platform: GCP (IPI with CAPI, Workload Identity Federation)
  • Region: us-east1
  • Confirmed across 5+ cluster creation attempts

Impact

This affects all GCP CAPI installs in OCP 4.22, not just WIF/STS. CAPI became the default for GCP IPI in 4.22.

Note

Responses generated with Claude

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions