Skip to content

[Detail Bug] Control plane bootstrap: ServiceCIDR creation failures are silently treated as success, preventing retry #20

@detail-app

Description

@detail-app

Summary

  • Context: The ServiceCIDRBootstrapper creates a default ServiceCIDR object in each virtual control plane to enable service IP allocation.
  • Bug: The bootstrap() method always returns nil, ignoring context timeout and making it impossible to detect when ServiceCIDR creation failed.
  • Actual vs. expected: The method should return an error when the context times out before ServiceCIDR is created, allowing retry on subsequent requests. Instead, it marks the cluster as "done" regardless of success.
  • Impact: If ServiceCIDR creation takes longer than the 10-second timeout or fails, the cluster will be permanently marked as bootstrapped without a ServiceCIDR, breaking service IP allocation for that cluster.

Code with bug

// servicecidr.go:84-94
func (b *ServiceCIDRBootstrapper) bootstrap(clusterID string) error {
    client, err := b.clientForCluster(clusterID)
    if err != nil {
        return err
    }
    ctrl := defaultservicecidr.NewController(b.primaryRange, b.secondaryRange, client)
    ctx, cancel := context.WithTimeout(context.Background(), b.timeout)
    defer cancel()
    ctrl.Start(ctx)
    return nil  // <-- BUG: Always returns nil, even on timeout
}

Evidence

Evidence 1: Upstream controller Start() implementation (exact source)

From k8s.io/kubernetes/pkg/controlplane/controller/defaultservicecidr/default_servicecidr_controller.go:

// Start will not return until the default ServiceCIDR exists or stopCh is closed.
func (c *Controller) Start(ctx context.Context) {
    defer utilruntime.HandleCrash()
    stopCh := ctx.Done()  // <-- Gets channel that closes when context is done

    // ... event broadcaster setup ...

    go c.serviceCIDRInformer.Run(stopCh)
    if !cache.WaitForNamedCacheSync(controllerName, stopCh, c.serviceCIDRsSynced) {
        return  // <-- Returns early if context done before cache syncs
    }

    // wait until first successfully sync
    // this blocks apiserver startup so poll with a short interval
    err := wait.PollUntilContextCancel(ctx, 100*time.Millisecond, true, func(ctx context.Context) (bool, error) {
        syncErr := c.sync()
        return syncErr == nil, nil
    })
    if err != nil {
        klog.Infof("error initializing the default ServiceCIDR: %v", err)
        // <-- CONTINUES EXECUTION, does not return
    }

    // run the sync loop in the background with the defined interval
    go wait.Until(func() {
        err := c.sync()
        if err != nil {
            klog.Infof("error trying to sync the default ServiceCIDR: %v", err)
        }
    }, c.interval, stopCh)  // <-- Uses stopCh (ctx.Done())
}

Source: github.com/kplane-dev/kubernetes at commit 20260404055358-aac72e7d04ad, used via go.mod replace directive at line 53.

Why wait.Until never runs the sync function after timeout

From k8s.io/apimachinery/pkg/util/wait/backoff.go:

func Until(f func(), period time.Duration, stopCh <-chan struct{}) {
    BackoffUntil(f, NewJitteredBackoffManager(period, 0.0, Clock), true, stopCh)
}

func BackoffUntil(f func(), backoff BackoffManager, sliding bool, stopCh <-chan struct{}) {
    var t clock.Timer
    for {
        select {
        case <-stopCh:  // <-- CHECKS STOPCH FIRST
            return      // <-- EXITS IMMEDIATELY if already closed
        default:
        }
        // ... timer logic and f() call happen AFTER this check ...
    }
}

When timeout fires, the background goroutine exits without ever running sync():

  1. wait.PollUntilContextCancel times out after 10 seconds
  2. Start() continues, spawns go wait.Until(func() { c.sync() }, 10*time.Second, stopCh)
  3. stopCh is ctx.Done(), which is already closed (timeout fired)
  4. BackoffUntil enters loop, select { case <-stopCh: return } matches immediately
  5. Goroutine exits without ever calling sync()

Evidence 2: ctx.Err() correctly distinguishes success from failure

The reviewer asks how we can distinguish success from failure. The key is understanding when ctx.Err() becomes non-nil.

Scenario wait.PollUntilContextCancel returns ctx.Err() after poll returns Background goroutine behavior
Sync succeeds before timeout nil (condition returned true) nil (context not cancelled) Runs every 10 seconds
Timeout fires before sync succeeds context.DeadlineExceeded context.DeadlineExceeded Exits immediately (see Evidence 1)

Critical insight: defer cancel() in bootstrap() has NOT run when Start() returns. The context is only cancelled by the timeout, not by the defer. So:

  • ctx.Err() == nil after Start() returns → sync succeeded
  • ctx.Err() != nil after Start() returns → sync failed due to timeout

Evidence 3: Why sibling bootstrappers don't need ctx.Err() check

The reviewer asks why systemnamespaces.go and rbac.go don't check ctx.Err(). The answer: they call functions that return errors.

systemnamespaces.go:81-108:

func (b *SystemNamespaceBootstrapper) bootstrap(clusterID string) error {
    ctx, cancel := context.WithTimeout(context.Background(), b.timeout)
    defer cancel()

    for _, ns := range b.namespaces {
        _, err := client.CoreV1().Namespaces().Create(ctx, &corev1.Namespace{...}, metav1.CreateOptions{})
        if err != nil && !apierrors.IsAlreadyExists(err) {
            return err  // <-- API call returns error on timeout
        }
    }
    return nil
}

The API client returns context.DeadlineExceeded when the context times out. The error is propagated directly.

rbac.go:90-116:

func (b *RBACBootstrapper) bootstrap(clusterID string) error {
    ctx, cancel := context.WithTimeout(context.Background(), b.timeout)
    defer cancel()
    return hook(genericapiserver.PostStartHookContext{
        Context: ctx,
    })  // <-- hook() returns error on timeout
}

servicecidr.go is unique because ctrl.Start(ctx) returns void (no error value). It's the only bootstrapper that calls a fire-and-forget function. The controller's Start() method was designed for apiserver startup where:

  1. The process continues running indefinitely
  2. Errors are logged, not propagated
  3. There's no retry mechanism needed

The bootstrapper misuses this controller by:

  1. Applying a short timeout (10 seconds)
  2. Expecting error propagation (which Start() doesn't support)
  3. Implementing a retry mechanism (which Start() doesn't participate in)

Evidence 4: The inflight blocking mechanism doesn't compensate for the bug

The reviewer correctly notes that inflight stores chan struct{} and concurrent requests block and wait, not return early.

servicecidr.go:52-82:

func (b *ServiceCIDRBootstrapper) Ensure(clusterID string) {
    b.mu.Lock()
    if _, ok := b.done[clusterID]; ok {
        b.mu.Unlock()
        return
    }
    if ch, ok := b.inflight[clusterID]; ok {
        b.mu.Unlock()
        <-ch  // <-- BLOCKS until channel closed
        return
    }
    ch := make(chan struct{})
    b.inflight[clusterID] = ch
    b.mu.Unlock()

    err := b.bootstrap(clusterID)

    b.mu.Lock()
    delete(b.inflight, clusterID)
    if err == nil {
        b.done[clusterID] = struct{}{}
    }
    close(ch)  // <-- Unblocks waiters AFTER done is set
    b.mu.Unlock()
}

The blocking mechanism doesn't help because:

  1. Request 1 triggers bootstrap, sets inflight[clusterID] = ch
  2. Request 2 arrives, blocks on <-ch
  3. Request 1's bootstrap times out, returns nil (BUG)
  4. Request 1 sets done[clusterID], then closes ch
  5. Request 2 unblocks, returns (doesn't retry - just exits)
  6. Request 3 arrives, sees done[clusterID], returns without retrying

The blocking ensures requests 2+ wait for request 1 to complete, but they still don't retry because done is set when the channel is closed.

Evidence 5: The integration test doesn't verify retry behavior

From test/smoke/core_bootstrap_test.go:45-76:

func TestServiceCIDRBootstrapPerCluster(t *testing.T) {
    // ...
    deadline := time.Now().Add(10 * time.Second)
    for {
        _, err := cs.NetworkingV1().ServiceCIDRs().Get(t.Context(), "kubernetes", metav1.GetOptions{})
        if err == nil {
            return  // Success
        }
        if time.Now().After(deadline) {
            t.Skipf("ServiceCIDR bootstrap not observed (likely feature-disabled): %v", err)  // <-- SKIPS on timeout
        }
        time.Sleep(250 * time.Millisecond)
    }
}

The test:

  1. Waits up to 10 seconds for ServiceCIDR to appear
  2. Skips (doesn't fail) if timeout occurs
  3. Does NOT test the retry mechanism
  4. Does NOT verify what happens when a second request arrives after bootstrap failure

Reproduction scenario

To trigger this bug:

  1. Enable MultiCIDRServiceAllocator feature gate (required for ServiceCIDR functionality)
  2. Set --service-cidr-sharing-mode=per-cluster (required for per-cluster bootstrapping)
  3. First request to a new cluster triggers bootstrap
  4. Etcd/network issues cause ServiceCIDR creation to take > 10 seconds
  5. Bootstrap times out, returns nil, sets done[clusterID]
  6. All subsequent requests to that cluster see done[clusterID] and skip retry
  7. ServiceCIDR never created, service IP allocation broken

Recommended fix

func (b *ServiceCIDRBootstrapper) bootstrap(clusterID string) error {
    client, err := b.clientForCluster(clusterID)
    if err != nil {
        return err
    }
    ctrl := defaultservicecidr.NewController(b.primaryRange, b.secondaryRange, client)
    ctx, cancel := context.WithTimeout(context.Background(), b.timeout)
    defer cancel()
    ctrl.Start(ctx)
    if err := ctx.Err(); err != nil {
        return fmt.Errorf("servicecidr bootstrap interrupted: %w", err)
    }
    return nil
}

This fix:

  1. Returns nil when sync succeeds (ctx.Err() is nil - context not cancelled)
  2. Returns error when sync fails due to timeout (ctx.Err() is DeadlineExceeded)
  3. Enables the retry mechanism in Ensure() to work as designed
  4. Follows the same pattern as sibling bootstrappers: propagate errors so retry can occur

Why this is a bug, not a design choice

  1. The done map implies retry semantics: Clusters are only added to done when bootstrap() returns nil. If bootstrap is meant to be fire-and-forget, the done map wouldn't exist.

  2. The inflight blocking implies retry semantics: If bootstrap failure is acceptable, why block concurrent requests? They could just return immediately.

  3. The error logging at line 80 implies error handling: klog.Errorf("mc.bootstrap servicecidr failed cluster=%s: %v", clusterID, err) - this code path is unreachable because bootstrap() never returns an error.

  4. The pattern is inconsistent: Sibling bootstrappers propagate errors. Only ServiceCIDR breaks the pattern, and only because Start() returns void.

History

This bug was introduced in commit 258c8bd (@zachsmith1, 2026-02-13, PR #5). The commit added the ServiceCIDRBootstrapper to align with upstream apiserver controllers, but the author overlooked that defaultservicecidr.Controller.Start() is designed for apiserver startup (fire-and-forget with continuous background sync) rather than a bootstrapper pattern that needs timeout detection and retry semantics. The return nil without checking ctx.Err() has been present since the file was first created.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdetail

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions