GCP CAPI: bootstrap in master instance group causes worker ignition failure via ILB pinning

## Bug

Worker nodes never join GCP CAPI-based OCP 4.22 clusters. Workers are provisioned (GCP VMs running) but stay stuck in an ignition fetch loop indefinitely, receiving HTTP 500 from the Machine Config Server (MCS).

## Root Cause

In GCP CAPI installs, the installer places the bootstrap node in the **same unmanaged instance group** as master-0 (zone-a/b). Both are backends for the GCP Internal Load Balancer (ILB) on ports 6443 and 22623.

The GCP ILB uses **connection-based session affinity** (`CONNECTION` mode) — once a worker's TCP connection is established to a backend, it stays pinned for the connection lifetime.

During the critical window when workers first boot and fetch ignition config from `api-int:22623`, the bootstrap node is the healthiest/first-responding backend. All worker connections get pinned to bootstrap.

Bootstrap MCS is designed to **refuse worker ignition requests** — it only serves master configs:

```
refusing to serve bootstrap configuration to pool "worker"
```

Workers receive HTTP 500 forever. The in-cluster MCS running on master-0 receives **zero worker requests** despite being healthy and ready.

## Evidence

- Worker serial console: 900+ ignition GET attempts, all returning `Internal Server Error`
- Bootstrap MCS logs: explicit `refusing to serve bootstrap configuration to pool "worker"` 
- In-cluster MCS logs: zero worker requests despite running 40+ minutes
- Zero worker CSRs in the cluster
- `gcloud compute instance-groups unmanaged list-instances` confirms bootstrap and master-0 in same group

## Workaround

Remove bootstrap from the master instance group after masters are ready:

```bash
gcloud compute instance-groups unmanaged remove-instances \
  <cluster-id>-master-<zone> \
  --zone=<zone> \
  --instances=<cluster-id>-bootstrap \
  --project=<project>
```

All 3 workers completed ignition within minutes and joined the cluster after applying this.

## Suggested Fix

The installer's GCP CAPI code should either:

1. **Give bootstrap its own separate instance group** (as Terraform/UPI installs did), or
2. **Remove bootstrap from the master instance group** once masters are healthy and in-cluster MCS is serving

## Environment

- OCP: 4.22.0-rc.5 (also reproduced on 4.22.0-rc.4)
- Platform: GCP (IPI with CAPI, Workload Identity Federation)
- Region: us-east1
- Confirmed across 5+ cluster creation attempts

## Impact

This affects **all GCP CAPI installs** in OCP 4.22, not just WIF/STS. CAPI became the default for GCP IPI in 4.22.

> [!Note]
> Responses generated with Claude

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCP CAPI: bootstrap in master instance group causes worker ignition failure via ILB pinning #10590

Bug

Root Cause

Evidence

Workaround

Suggested Fix

Environment

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

GCP CAPI: bootstrap in master instance group causes worker ignition failure via ILB pinning #10590

Description

Bug

Root Cause

Evidence

Workaround

Suggested Fix

Environment

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions