Summary
- Context: The
ServiceCIDRBootstrapper creates a default ServiceCIDR object in each virtual control plane to enable service IP allocation.
- Bug: The
bootstrap() method always returns nil, ignoring context timeout and making it impossible to detect when ServiceCIDR creation failed.
- Actual vs. expected: The method should return an error when the context times out before ServiceCIDR is created, allowing retry on subsequent requests. Instead, it marks the cluster as "done" regardless of success.
- Impact: If ServiceCIDR creation takes longer than the 10-second timeout or fails, the cluster will be permanently marked as bootstrapped without a ServiceCIDR, breaking service IP allocation for that cluster.
Code with bug
// servicecidr.go:84-94
func (b *ServiceCIDRBootstrapper) bootstrap(clusterID string) error {
client, err := b.clientForCluster(clusterID)
if err != nil {
return err
}
ctrl := defaultservicecidr.NewController(b.primaryRange, b.secondaryRange, client)
ctx, cancel := context.WithTimeout(context.Background(), b.timeout)
defer cancel()
ctrl.Start(ctx)
return nil // <-- BUG: Always returns nil, even on timeout
}
Evidence
Evidence 1: Upstream controller Start() implementation (exact source)
From k8s.io/kubernetes/pkg/controlplane/controller/defaultservicecidr/default_servicecidr_controller.go:
// Start will not return until the default ServiceCIDR exists or stopCh is closed.
func (c *Controller) Start(ctx context.Context) {
defer utilruntime.HandleCrash()
stopCh := ctx.Done() // <-- Gets channel that closes when context is done
// ... event broadcaster setup ...
go c.serviceCIDRInformer.Run(stopCh)
if !cache.WaitForNamedCacheSync(controllerName, stopCh, c.serviceCIDRsSynced) {
return // <-- Returns early if context done before cache syncs
}
// wait until first successfully sync
// this blocks apiserver startup so poll with a short interval
err := wait.PollUntilContextCancel(ctx, 100*time.Millisecond, true, func(ctx context.Context) (bool, error) {
syncErr := c.sync()
return syncErr == nil, nil
})
if err != nil {
klog.Infof("error initializing the default ServiceCIDR: %v", err)
// <-- CONTINUES EXECUTION, does not return
}
// run the sync loop in the background with the defined interval
go wait.Until(func() {
err := c.sync()
if err != nil {
klog.Infof("error trying to sync the default ServiceCIDR: %v", err)
}
}, c.interval, stopCh) // <-- Uses stopCh (ctx.Done())
}
Source: github.com/kplane-dev/kubernetes at commit 20260404055358-aac72e7d04ad, used via go.mod replace directive at line 53.
Why wait.Until never runs the sync function after timeout
From k8s.io/apimachinery/pkg/util/wait/backoff.go:
func Until(f func(), period time.Duration, stopCh <-chan struct{}) {
BackoffUntil(f, NewJitteredBackoffManager(period, 0.0, Clock), true, stopCh)
}
func BackoffUntil(f func(), backoff BackoffManager, sliding bool, stopCh <-chan struct{}) {
var t clock.Timer
for {
select {
case <-stopCh: // <-- CHECKS STOPCH FIRST
return // <-- EXITS IMMEDIATELY if already closed
default:
}
// ... timer logic and f() call happen AFTER this check ...
}
}
When timeout fires, the background goroutine exits without ever running sync():
wait.PollUntilContextCancel times out after 10 seconds
Start() continues, spawns go wait.Until(func() { c.sync() }, 10*time.Second, stopCh)
stopCh is ctx.Done(), which is already closed (timeout fired)
BackoffUntil enters loop, select { case <-stopCh: return } matches immediately
- Goroutine exits without ever calling
sync()
Evidence 2: ctx.Err() correctly distinguishes success from failure
The reviewer asks how we can distinguish success from failure. The key is understanding when ctx.Err() becomes non-nil.
| Scenario |
wait.PollUntilContextCancel returns |
ctx.Err() after poll returns |
Background goroutine behavior |
| Sync succeeds before timeout |
nil (condition returned true) |
nil (context not cancelled) |
Runs every 10 seconds |
| Timeout fires before sync succeeds |
context.DeadlineExceeded |
context.DeadlineExceeded |
Exits immediately (see Evidence 1) |
Critical insight: defer cancel() in bootstrap() has NOT run when Start() returns. The context is only cancelled by the timeout, not by the defer. So:
ctx.Err() == nil after Start() returns → sync succeeded
ctx.Err() != nil after Start() returns → sync failed due to timeout
Evidence 3: Why sibling bootstrappers don't need ctx.Err() check
The reviewer asks why systemnamespaces.go and rbac.go don't check ctx.Err(). The answer: they call functions that return errors.
systemnamespaces.go:81-108:
func (b *SystemNamespaceBootstrapper) bootstrap(clusterID string) error {
ctx, cancel := context.WithTimeout(context.Background(), b.timeout)
defer cancel()
for _, ns := range b.namespaces {
_, err := client.CoreV1().Namespaces().Create(ctx, &corev1.Namespace{...}, metav1.CreateOptions{})
if err != nil && !apierrors.IsAlreadyExists(err) {
return err // <-- API call returns error on timeout
}
}
return nil
}
The API client returns context.DeadlineExceeded when the context times out. The error is propagated directly.
rbac.go:90-116:
func (b *RBACBootstrapper) bootstrap(clusterID string) error {
ctx, cancel := context.WithTimeout(context.Background(), b.timeout)
defer cancel()
return hook(genericapiserver.PostStartHookContext{
Context: ctx,
}) // <-- hook() returns error on timeout
}
servicecidr.go is unique because ctrl.Start(ctx) returns void (no error value). It's the only bootstrapper that calls a fire-and-forget function. The controller's Start() method was designed for apiserver startup where:
- The process continues running indefinitely
- Errors are logged, not propagated
- There's no retry mechanism needed
The bootstrapper misuses this controller by:
- Applying a short timeout (10 seconds)
- Expecting error propagation (which
Start() doesn't support)
- Implementing a retry mechanism (which
Start() doesn't participate in)
Evidence 4: The inflight blocking mechanism doesn't compensate for the bug
The reviewer correctly notes that inflight stores chan struct{} and concurrent requests block and wait, not return early.
servicecidr.go:52-82:
func (b *ServiceCIDRBootstrapper) Ensure(clusterID string) {
b.mu.Lock()
if _, ok := b.done[clusterID]; ok {
b.mu.Unlock()
return
}
if ch, ok := b.inflight[clusterID]; ok {
b.mu.Unlock()
<-ch // <-- BLOCKS until channel closed
return
}
ch := make(chan struct{})
b.inflight[clusterID] = ch
b.mu.Unlock()
err := b.bootstrap(clusterID)
b.mu.Lock()
delete(b.inflight, clusterID)
if err == nil {
b.done[clusterID] = struct{}{}
}
close(ch) // <-- Unblocks waiters AFTER done is set
b.mu.Unlock()
}
The blocking mechanism doesn't help because:
- Request 1 triggers bootstrap, sets
inflight[clusterID] = ch
- Request 2 arrives, blocks on
<-ch
- Request 1's bootstrap times out, returns
nil (BUG)
- Request 1 sets
done[clusterID], then closes ch
- Request 2 unblocks, returns (doesn't retry - just exits)
- Request 3 arrives, sees
done[clusterID], returns without retrying
The blocking ensures requests 2+ wait for request 1 to complete, but they still don't retry because done is set when the channel is closed.
Evidence 5: The integration test doesn't verify retry behavior
From test/smoke/core_bootstrap_test.go:45-76:
func TestServiceCIDRBootstrapPerCluster(t *testing.T) {
// ...
deadline := time.Now().Add(10 * time.Second)
for {
_, err := cs.NetworkingV1().ServiceCIDRs().Get(t.Context(), "kubernetes", metav1.GetOptions{})
if err == nil {
return // Success
}
if time.Now().After(deadline) {
t.Skipf("ServiceCIDR bootstrap not observed (likely feature-disabled): %v", err) // <-- SKIPS on timeout
}
time.Sleep(250 * time.Millisecond)
}
}
The test:
- Waits up to 10 seconds for ServiceCIDR to appear
- Skips (doesn't fail) if timeout occurs
- Does NOT test the retry mechanism
- Does NOT verify what happens when a second request arrives after bootstrap failure
Reproduction scenario
To trigger this bug:
- Enable
MultiCIDRServiceAllocator feature gate (required for ServiceCIDR functionality)
- Set
--service-cidr-sharing-mode=per-cluster (required for per-cluster bootstrapping)
- First request to a new cluster triggers bootstrap
- Etcd/network issues cause ServiceCIDR creation to take > 10 seconds
- Bootstrap times out, returns
nil, sets done[clusterID]
- All subsequent requests to that cluster see
done[clusterID] and skip retry
- ServiceCIDR never created, service IP allocation broken
Recommended fix
func (b *ServiceCIDRBootstrapper) bootstrap(clusterID string) error {
client, err := b.clientForCluster(clusterID)
if err != nil {
return err
}
ctrl := defaultservicecidr.NewController(b.primaryRange, b.secondaryRange, client)
ctx, cancel := context.WithTimeout(context.Background(), b.timeout)
defer cancel()
ctrl.Start(ctx)
if err := ctx.Err(); err != nil {
return fmt.Errorf("servicecidr bootstrap interrupted: %w", err)
}
return nil
}
This fix:
- Returns
nil when sync succeeds (ctx.Err() is nil - context not cancelled)
- Returns error when sync fails due to timeout (
ctx.Err() is DeadlineExceeded)
- Enables the retry mechanism in
Ensure() to work as designed
- Follows the same pattern as sibling bootstrappers: propagate errors so retry can occur
Why this is a bug, not a design choice
-
The done map implies retry semantics: Clusters are only added to done when bootstrap() returns nil. If bootstrap is meant to be fire-and-forget, the done map wouldn't exist.
-
The inflight blocking implies retry semantics: If bootstrap failure is acceptable, why block concurrent requests? They could just return immediately.
-
The error logging at line 80 implies error handling: klog.Errorf("mc.bootstrap servicecidr failed cluster=%s: %v", clusterID, err) - this code path is unreachable because bootstrap() never returns an error.
-
The pattern is inconsistent: Sibling bootstrappers propagate errors. Only ServiceCIDR breaks the pattern, and only because Start() returns void.
History
This bug was introduced in commit 258c8bd (@zachsmith1, 2026-02-13, PR #5). The commit added the ServiceCIDRBootstrapper to align with upstream apiserver controllers, but the author overlooked that defaultservicecidr.Controller.Start() is designed for apiserver startup (fire-and-forget with continuous background sync) rather than a bootstrapper pattern that needs timeout detection and retry semantics. The return nil without checking ctx.Err() has been present since the file was first created.
Summary
ServiceCIDRBootstrappercreates a default ServiceCIDR object in each virtual control plane to enable service IP allocation.bootstrap()method always returnsnil, ignoring context timeout and making it impossible to detect when ServiceCIDR creation failed.Code with bug
Evidence
Evidence 1: Upstream controller
Start()implementation (exact source)From
k8s.io/kubernetes/pkg/controlplane/controller/defaultservicecidr/default_servicecidr_controller.go:Source:
github.com/kplane-dev/kubernetesat commit20260404055358-aac72e7d04ad, used via go.mod replace directive at line 53.Why
wait.Untilnever runs the sync function after timeoutFrom
k8s.io/apimachinery/pkg/util/wait/backoff.go:When timeout fires, the background goroutine exits without ever running
sync():wait.PollUntilContextCanceltimes out after 10 secondsStart()continues, spawnsgo wait.Until(func() { c.sync() }, 10*time.Second, stopCh)stopChisctx.Done(), which is already closed (timeout fired)BackoffUntilenters loop,select { case <-stopCh: return }matches immediatelysync()Evidence 2:
ctx.Err()correctly distinguishes success from failureThe reviewer asks how we can distinguish success from failure. The key is understanding when
ctx.Err()becomes non-nil.wait.PollUntilContextCancelreturnsctx.Err()after poll returnsnil(condition returnedtrue)nil(context not cancelled)context.DeadlineExceededcontext.DeadlineExceededCritical insight:
defer cancel()inbootstrap()has NOT run whenStart()returns. The context is only cancelled by the timeout, not by the defer. So:ctx.Err() == nilafterStart()returns → sync succeededctx.Err() != nilafterStart()returns → sync failed due to timeoutEvidence 3: Why sibling bootstrappers don't need
ctx.Err()checkThe reviewer asks why
systemnamespaces.goandrbac.godon't checkctx.Err(). The answer: they call functions that return errors.systemnamespaces.go:81-108:
The API client returns
context.DeadlineExceededwhen the context times out. The error is propagated directly.rbac.go:90-116:
servicecidr.go is unique because
ctrl.Start(ctx)returns void (no error value). It's the only bootstrapper that calls a fire-and-forget function. The controller'sStart()method was designed for apiserver startup where:The bootstrapper misuses this controller by:
Start()doesn't support)Start()doesn't participate in)Evidence 4: The
inflightblocking mechanism doesn't compensate for the bugThe reviewer correctly notes that
inflightstoreschan struct{}and concurrent requests block and wait, not return early.servicecidr.go:52-82:
The blocking mechanism doesn't help because:
inflight[clusterID] = ch<-chnil(BUG)done[clusterID], then closeschdone[clusterID], returns without retryingThe blocking ensures requests 2+ wait for request 1 to complete, but they still don't retry because
doneis set when the channel is closed.Evidence 5: The integration test doesn't verify retry behavior
From
test/smoke/core_bootstrap_test.go:45-76:The test:
Reproduction scenario
To trigger this bug:
MultiCIDRServiceAllocatorfeature gate (required for ServiceCIDR functionality)--service-cidr-sharing-mode=per-cluster(required for per-cluster bootstrapping)nil, setsdone[clusterID]done[clusterID]and skip retryRecommended fix
This fix:
nilwhen sync succeeds (ctx.Err()isnil- context not cancelled)ctx.Err()isDeadlineExceeded)Ensure()to work as designedWhy this is a bug, not a design choice
The
donemap implies retry semantics: Clusters are only added todonewhenbootstrap()returnsnil. If bootstrap is meant to be fire-and-forget, thedonemap wouldn't exist.The
inflightblocking implies retry semantics: If bootstrap failure is acceptable, why block concurrent requests? They could just return immediately.The error logging at line 80 implies error handling:
klog.Errorf("mc.bootstrap servicecidr failed cluster=%s: %v", clusterID, err)- this code path is unreachable becausebootstrap()never returns an error.The pattern is inconsistent: Sibling bootstrappers propagate errors. Only ServiceCIDR breaks the pattern, and only because
Start()returns void.History
This bug was introduced in commit 258c8bd (@zachsmith1, 2026-02-13, PR #5). The commit added the ServiceCIDRBootstrapper to align with upstream apiserver controllers, but the author overlooked that
defaultservicecidr.Controller.Start()is designed for apiserver startup (fire-and-forget with continuous background sync) rather than a bootstrapper pattern that needs timeout detection and retry semantics. Thereturn nilwithout checkingctx.Err()has been present since the file was first created.