[Detail Bug] Control plane bootstrap: ServiceCIDR creation failures are silently treated as success, preventing retry

# Summary
- **Context**: The `ServiceCIDRBootstrapper` creates a default ServiceCIDR object in each virtual control plane to enable service IP allocation.
- **Bug**: The `bootstrap()` method always returns `nil`, ignoring context timeout and making it impossible to detect when ServiceCIDR creation failed.
- **Actual vs. expected**: The method should return an error when the context times out before ServiceCIDR is created, allowing retry on subsequent requests. Instead, it marks the cluster as "done" regardless of success.
- **Impact**: If ServiceCIDR creation takes longer than the 10-second timeout or fails, the cluster will be permanently marked as bootstrapped without a ServiceCIDR, breaking service IP allocation for that cluster.

# Code with bug
```go
// servicecidr.go:84-94
func (b *ServiceCIDRBootstrapper) bootstrap(clusterID string) error {
    client, err := b.clientForCluster(clusterID)
    if err != nil {
        return err
    }
    ctrl := defaultservicecidr.NewController(b.primaryRange, b.secondaryRange, client)
    ctx, cancel := context.WithTimeout(context.Background(), b.timeout)
    defer cancel()
    ctrl.Start(ctx)
    return nil  // <-- BUG: Always returns nil, even on timeout
}
```

# Evidence

## Evidence 1: Upstream controller `Start()` implementation (exact source)

From `k8s.io/kubernetes/pkg/controlplane/controller/defaultservicecidr/default_servicecidr_controller.go`:

```go
// Start will not return until the default ServiceCIDR exists or stopCh is closed.
func (c *Controller) Start(ctx context.Context) {
    defer utilruntime.HandleCrash()
    stopCh := ctx.Done()  // <-- Gets channel that closes when context is done

    // ... event broadcaster setup ...

    go c.serviceCIDRInformer.Run(stopCh)
    if !cache.WaitForNamedCacheSync(controllerName, stopCh, c.serviceCIDRsSynced) {
        return  // <-- Returns early if context done before cache syncs
    }

    // wait until first successfully sync
    // this blocks apiserver startup so poll with a short interval
    err := wait.PollUntilContextCancel(ctx, 100*time.Millisecond, true, func(ctx context.Context) (bool, error) {
        syncErr := c.sync()
        return syncErr == nil, nil
    })
    if err != nil {
        klog.Infof("error initializing the default ServiceCIDR: %v", err)
        // <-- CONTINUES EXECUTION, does not return
    }

    // run the sync loop in the background with the defined interval
    go wait.Until(func() {
        err := c.sync()
        if err != nil {
            klog.Infof("error trying to sync the default ServiceCIDR: %v", err)
        }
    }, c.interval, stopCh)  // <-- Uses stopCh (ctx.Done())
}
```

Source: `github.com/kplane-dev/kubernetes` at commit `20260404055358-aac72e7d04ad`, used via go.mod replace directive at line 53.

### Why `wait.Until` never runs the sync function after timeout

From `k8s.io/apimachinery/pkg/util/wait/backoff.go`:

```go
func Until(f func(), period time.Duration, stopCh <-chan struct{}) {
    BackoffUntil(f, NewJitteredBackoffManager(period, 0.0, Clock), true, stopCh)
}

func BackoffUntil(f func(), backoff BackoffManager, sliding bool, stopCh <-chan struct{}) {
    var t clock.Timer
    for {
        select {
        case <-stopCh:  // <-- CHECKS STOPCH FIRST
            return      // <-- EXITS IMMEDIATELY if already closed
        default:
        }
        // ... timer logic and f() call happen AFTER this check ...
    }
}
```

**When timeout fires, the background goroutine exits without ever running `sync()`:**

1. `wait.PollUntilContextCancel` times out after 10 seconds
2. `Start()` continues, spawns `go wait.Until(func() { c.sync() }, 10*time.Second, stopCh)`
3. `stopCh` is `ctx.Done()`, which is already closed (timeout fired)
4. `BackoffUntil` enters loop, `select { case <-stopCh: return }` matches immediately
5. Goroutine exits **without ever calling `sync()`**

## Evidence 2: `ctx.Err()` correctly distinguishes success from failure

The reviewer asks how we can distinguish success from failure. The key is understanding when `ctx.Err()` becomes non-nil.

| Scenario | `wait.PollUntilContextCancel` returns | `ctx.Err()` after poll returns | Background goroutine behavior |
|----------|---------------------------------------|-------------------------------|------------------------------|
| Sync succeeds before timeout | `nil` (condition returned `true`) | `nil` (context not cancelled) | Runs every 10 seconds |
| Timeout fires before sync succeeds | `context.DeadlineExceeded` | `context.DeadlineExceeded` | Exits immediately (see Evidence 1) |

**Critical insight**: `defer cancel()` in `bootstrap()` has NOT run when `Start()` returns. The context is only cancelled by the timeout, not by the defer. So:
- `ctx.Err() == nil` after `Start()` returns → sync succeeded
- `ctx.Err() != nil` after `Start()` returns → sync failed due to timeout

## Evidence 3: Why sibling bootstrappers don't need `ctx.Err()` check

The reviewer asks why `systemnamespaces.go` and `rbac.go` don't check `ctx.Err()`. The answer: they call functions that **return errors**.

**systemnamespaces.go:81-108**:
```go
func (b *SystemNamespaceBootstrapper) bootstrap(clusterID string) error {
    ctx, cancel := context.WithTimeout(context.Background(), b.timeout)
    defer cancel()

    for _, ns := range b.namespaces {
        _, err := client.CoreV1().Namespaces().Create(ctx, &corev1.Namespace{...}, metav1.CreateOptions{})
        if err != nil && !apierrors.IsAlreadyExists(err) {
            return err  // <-- API call returns error on timeout
        }
    }
    return nil
}
```

The API client returns `context.DeadlineExceeded` when the context times out. The error is propagated directly.

**rbac.go:90-116**:
```go
func (b *RBACBootstrapper) bootstrap(clusterID string) error {
    ctx, cancel := context.WithTimeout(context.Background(), b.timeout)
    defer cancel()
    return hook(genericapiserver.PostStartHookContext{
        Context: ctx,
    })  // <-- hook() returns error on timeout
}
```

**servicecidr.go** is unique because `ctrl.Start(ctx)` returns **void** (no error value). It's the only bootstrapper that calls a fire-and-forget function. The controller's `Start()` method was designed for apiserver startup where:
1. The process continues running indefinitely
2. Errors are logged, not propagated
3. There's no retry mechanism needed

The bootstrapper misuses this controller by:
1. Applying a short timeout (10 seconds)
2. Expecting error propagation (which `Start()` doesn't support)
3. Implementing a retry mechanism (which `Start()` doesn't participate in)

## Evidence 4: The `inflight` blocking mechanism doesn't compensate for the bug

The reviewer correctly notes that `inflight` stores `chan struct{}` and concurrent requests **block and wait**, not return early.

**servicecidr.go:52-82**:
```go
func (b *ServiceCIDRBootstrapper) Ensure(clusterID string) {
    b.mu.Lock()
    if _, ok := b.done[clusterID]; ok {
        b.mu.Unlock()
        return
    }
    if ch, ok := b.inflight[clusterID]; ok {
        b.mu.Unlock()
        <-ch  // <-- BLOCKS until channel closed
        return
    }
    ch := make(chan struct{})
    b.inflight[clusterID] = ch
    b.mu.Unlock()

    err := b.bootstrap(clusterID)

    b.mu.Lock()
    delete(b.inflight, clusterID)
    if err == nil {
        b.done[clusterID] = struct{}{}
    }
    close(ch)  // <-- Unblocks waiters AFTER done is set
    b.mu.Unlock()
}
```

**The blocking mechanism doesn't help because:**

1. Request 1 triggers bootstrap, sets `inflight[clusterID] = ch`
2. Request 2 arrives, blocks on `<-ch`
3. Request 1's bootstrap times out, returns `nil` (BUG)
4. Request 1 sets `done[clusterID]`, then closes `ch`
5. Request 2 unblocks, returns (doesn't retry - just exits)
6. Request 3 arrives, sees `done[clusterID]`, returns without retrying

The blocking ensures requests 2+ wait for request 1 to complete, but they still don't retry because `done` is set when the channel is closed.

## Evidence 5: The integration test doesn't verify retry behavior

From `test/smoke/core_bootstrap_test.go:45-76`:
```go
func TestServiceCIDRBootstrapPerCluster(t *testing.T) {
    // ...
    deadline := time.Now().Add(10 * time.Second)
    for {
        _, err := cs.NetworkingV1().ServiceCIDRs().Get(t.Context(), "kubernetes", metav1.GetOptions{})
        if err == nil {
            return  // Success
        }
        if time.Now().After(deadline) {
            t.Skipf("ServiceCIDR bootstrap not observed (likely feature-disabled): %v", err)  // <-- SKIPS on timeout
        }
        time.Sleep(250 * time.Millisecond)
    }
}
```

The test:
1. Waits up to 10 seconds for ServiceCIDR to appear
2. **Skips** (doesn't fail) if timeout occurs
3. Does NOT test the retry mechanism
4. Does NOT verify what happens when a second request arrives after bootstrap failure

# Reproduction scenario

To trigger this bug:
1. Enable `MultiCIDRServiceAllocator` feature gate (required for ServiceCIDR functionality)
2. Set `--service-cidr-sharing-mode=per-cluster` (required for per-cluster bootstrapping)
3. First request to a new cluster triggers bootstrap
4. Etcd/network issues cause ServiceCIDR creation to take > 10 seconds
5. Bootstrap times out, returns `nil`, sets `done[clusterID]`
6. All subsequent requests to that cluster see `done[clusterID]` and skip retry
7. ServiceCIDR never created, service IP allocation broken

# Recommended fix

```go
func (b *ServiceCIDRBootstrapper) bootstrap(clusterID string) error {
    client, err := b.clientForCluster(clusterID)
    if err != nil {
        return err
    }
    ctrl := defaultservicecidr.NewController(b.primaryRange, b.secondaryRange, client)
    ctx, cancel := context.WithTimeout(context.Background(), b.timeout)
    defer cancel()
    ctrl.Start(ctx)
    if err := ctx.Err(); err != nil {
        return fmt.Errorf("servicecidr bootstrap interrupted: %w", err)
    }
    return nil
}
```

This fix:
1. Returns `nil` when sync succeeds (`ctx.Err()` is `nil` - context not cancelled)
2. Returns error when sync fails due to timeout (`ctx.Err()` is `DeadlineExceeded`)
3. Enables the retry mechanism in `Ensure()` to work as designed
4. Follows the same pattern as sibling bootstrappers: propagate errors so retry can occur

# Why this is a bug, not a design choice

1. **The `done` map implies retry semantics**: Clusters are only added to `done` when `bootstrap()` returns `nil`. If bootstrap is meant to be fire-and-forget, the `done` map wouldn't exist.

2. **The `inflight` blocking implies retry semantics**: If bootstrap failure is acceptable, why block concurrent requests? They could just return immediately.

3. **The error logging at line 80 implies error handling**: `klog.Errorf("mc.bootstrap servicecidr failed cluster=%s: %v", clusterID, err)` - this code path is unreachable because `bootstrap()` never returns an error.

4. **The pattern is inconsistent**: Sibling bootstrappers propagate errors. Only ServiceCIDR breaks the pattern, and only because `Start()` returns void.

# History
This bug was introduced in commit 258c8bd (@zachsmith1, 2026-02-13, PR #5). The commit added the ServiceCIDRBootstrapper to align with upstream apiserver controllers, but the author overlooked that `defaultservicecidr.Controller.Start()` is designed for apiserver startup (fire-and-forget with continuous background sync) rather than a bootstrapper pattern that needs timeout detection and retry semantics. The `return nil` without checking `ctx.Err()` has been present since the file was first created.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Detail Bug] Control plane bootstrap: ServiceCIDR creation failures are silently treated as success, preventing retry #20

Summary

Code with bug

Evidence

Evidence 1: Upstream controller `Start()` implementation (exact source)

Why `wait.Until` never runs the sync function after timeout

Evidence 2: `ctx.Err()` correctly distinguishes success from failure

Evidence 3: Why sibling bootstrappers don't need `ctx.Err()` check

Evidence 4: The `inflight` blocking mechanism doesn't compensate for the bug

Evidence 5: The integration test doesn't verify retry behavior

Reproduction scenario

Recommended fix

Why this is a bug, not a design choice

History

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario	`wait.PollUntilContextCancel` returns	`ctx.Err()` after poll returns	Background goroutine behavior
Sync succeeds before timeout	`nil` (condition returned `true`)	`nil` (context not cancelled)	Runs every 10 seconds
Timeout fires before sync succeeds	`context.DeadlineExceeded`	`context.DeadlineExceeded`	Exits immediately (see Evidence 1)

[Detail Bug] Control plane bootstrap: ServiceCIDR creation failures are silently treated as success, preventing retry #20

Description

Summary

Code with bug

Evidence

Evidence 1: Upstream controller Start() implementation (exact source)

Why wait.Until never runs the sync function after timeout

Evidence 2: ctx.Err() correctly distinguishes success from failure

Evidence 3: Why sibling bootstrappers don't need ctx.Err() check

Evidence 4: The inflight blocking mechanism doesn't compensate for the bug

Evidence 5: The integration test doesn't verify retry behavior

Reproduction scenario

Recommended fix

Why this is a bug, not a design choice

History

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Evidence 1: Upstream controller `Start()` implementation (exact source)

Why `wait.Until` never runs the sync function after timeout

Evidence 2: `ctx.Err()` correctly distinguishes success from failure

Evidence 3: Why sibling bootstrappers don't need `ctx.Err()` check

Evidence 4: The `inflight` blocking mechanism doesn't compensate for the bug