Summary
- Context:
InformerRegistry creates MultiClusterInformer instances on-demand, passing its context to control the informer's lifetime.
- Bug:
Get() returns "success" with an informer that will never sync when the registry's context is already cancelled.
- Actual vs. expected: Callers expect
Get() to either return a working informer or an error; instead they get an informer that appears valid but never syncs.
- Impact: Callers that wait for
HasSynced() without context checks spin forever. At least one such caller exists: webhook/manager.go:139-154.
Code with Bug
pkg/multicluster/informer_registry.go:45-75:
func (r *InformerRegistry) Get(gr schema.GroupResource) (*informer.MultiClusterInformer, error) {
// ... checks for cached informer ...
mci := informer.New(informer.Config{...})
go mci.Run(r.ctx) // <-- BUG: Creates informer that immediately stops if r.ctx cancelled
r.informers[gr] = mci
return mci, nil // <-- Returns "success" for broken informer
}
Reproducible Test Case
func TestInformerRegistry_GetWithCancelledContext(t *testing.T) {
ctx, cancel := context.WithCancel(context.Background())
cancel() // Pre-cancel: simulates DrainedNotify() firing
registry := NewInformerRegistry(ctx)
registry.RegisterStorage(schema.GroupResource{Group: "test", Resource: "things"}, mockStorage)
mci, err := registry.Get(schema.GroupResource{Group: "test", Resource: "things"})
if err != nil {
t.Fatalf("unexpected error: %v", err) // Currently: no error returned
}
// Give the goroutine time to start and exit
time.Sleep(50 * time.Millisecond)
// BUG: HasSynced is false forever because Run() exited immediately
if mci.HasSynced() {
t.Log("informer synced")
} else {
t.Error("informer never syncs - Get() should have returned error instead")
}
// Demonstrate the wait loop hangs
done := make(chan bool)
go func() {
for !mci.HasSynced() {
time.Sleep(10 * time.Millisecond)
}
done <- true
}()
select {
case <-done:
t.Log("wait completed")
case <-time.After(500 * time.Millisecond):
t.Error("wait hangs - informer will never sync")
}
}
Run with: go test -v -run TestInformerRegistry_GetWithCancelledContext ./pkg/multicluster/
Verified Affected Call Path
Path 1: webhook.Manager.envForCluster (NO context check)
pkg/multicluster/admission/webhook/manager.go:139-154:
// Close synced channel when all MCIs have completed initial sync.
mcis := []*mcinformer.MultiClusterInformer{nsMCI, servicesMCI, endpointSlicesMCI, mutatingMCI, validatingMCI}
go func() {
for {
allSynced := true
for _, mci := range mcis {
if !mci.HasSynced() {
allSynced = false
break
}
}
if allSynced {
close(synced)
return
}
time.Sleep(50 * time.Millisecond)
// NO context check - goroutine never terminates if HasSynced() is always false
}
}()
If the registry's context is cancelled when envForCluster is called:
- All 5
Get() calls return informers that never sync
- The goroutine spins forever (20 checks/second)
synced channel is never closed
- Goroutine leak
Path 2: crd_runtime_manager.ensureSharedCRDState (context.Background when stopCh is nil)
pkg/multicluster/bootstrap/crd_runtime_manager_wrapped.go:368-379:
ctx := context.Background() // <-- Never cancels when stopCh is nil
if stopCh != nil {
ctx = wait.ContextForChannel(stopCh)
}
for !mci.HasSynced() {
select {
case <-ctx.Done(): // <-- Never fires when stopCh is nil
return nil, ctx.Err()
default:
time.Sleep(10 * time.Millisecond)
}
}
Callers CRDGetterForRequest (line 488) and CRDListerForRequest (line 499) pass nil stopCh.
Contrast: auth.Manager Has Context Check
pkg/multicluster/auth/manager.go:231-238:
for !mci.HasSynced() {
select {
case <-m.ctx.Done(): // <-- Correct: checks context
m.rbacErr = m.ctx.Err()
return
default:
time.Sleep(10 * time.Millisecond)
}
}
This shows the authors understood the pattern. The missing check in webhook/manager.go is an inconsistency, not a design choice.
External Dependency Verification
go.mod:
github.com/kplane-dev/informer v0.0.0-20260404055613-d9279cfd5e9b
Verified source at /root/go/pkg/mod/github.com/kplane-dev/informer@v0.0.0-20260404055613-d9279cfd5e9b/informer.go:
func (m *MultiClusterInformer) Run(ctx context.Context) {
// ... setup ...
for {
if ctx.Err() != nil { // Line 102: immediate return if context cancelled
return
}
m.listWatch(ctx) // Line 105: sets hasSynced = true (line 140)
}
}
When ctx.Err() != nil on entry, Run() returns immediately without calling listWatch(). HasSynced() returns false forever.
Why This Is A Bug
1. Contract Violation
Get() should return:
- A working informer, OR
- An error
When the registry's context is cancelled, Get() returns an informer that:
- Is cached (line 74)
- Will never sync
- Has
HasSynced() == false forever
The caller has no way to distinguish "informer is still syncing" from "informer will never sync".
2. Resource Leak
The goroutine in webhook/manager.go:139-154:
- Spins at 20 iterations/second
- Never terminates (no context check)
- Never closes the
synced channel
Even if "the process is terminating anyway," this is a resource leak by Go's standards. Go linters (lostcancel, govet) flag similar patterns as bugs.
3. Inconsistent Defensive Patterns
The codebase has 3 "wait for HasSynced" loops:
| Location |
Has Context Check? |
auth/manager.go:231 |
Yes |
crd_runtime_manager_wrapped.go:372 |
Yes (but uses context.Background when stopCh=nil) |
webhook/manager.go:143 |
No |
The inconsistency indicates an oversight, not intentional design.
Recommended Fix
func (r *InformerRegistry) Get(gr schema.GroupResource) (*informer.MultiClusterInformer, error) {
// Check context before any work
select {
case <-r.ctx.Done():
return nil, fmt.Errorf("informer registry shutdown for %s: %w", gr, r.ctx.Err())
default:
}
r.mu.Lock()
if mci, ok := r.informers[gr]; ok {
r.mu.Unlock()
return mci, nil
}
cs, ok := r.storages[gr]
r.mu.Unlock()
if !ok {
return nil, fmt.Errorf("no registered storage for %s", gr)
}
store, err := cs.ensureStore()
if err != nil {
return nil, fmt.Errorf("failed to ensure store for %s: %w", gr, err)
}
r.mu.Lock()
defer r.mu.Unlock()
if mci, ok := r.informers[gr]; ok {
return mci, nil
}
// Double-check after potential slow operation
select {
case <-r.ctx.Done():
return nil, fmt.Errorf("informer registry shutdown for %s: %w", gr, r.ctx.Err())
default:
}
mci := informer.New(informer.Config{
Storage: store,
ResourcePrefix: cs.kindRootPrefix(),
GroupResource: gr,
})
go mci.Run(r.ctx)
r.informers[gr] = mci
return mci, nil
}
Response to Reviewer Concerns
| Concern |
Response |
| "Wait has 10s timeout" |
Timeout prevents request hang, but goroutine still leaks |
| "Process terminates anyway" |
Poor engineering practice; Go explicitly rejects this reasoning |
| "Narrow race window" |
True, but narrow window ≠ no bug; invariant is violated |
| "Fix callers instead" |
Registry fix is centralized; caller fix would require auditing all 3 locations |
| "TOCTOU in fix" |
Race window shrinks from "forever" to "nanoseconds"; cached informers bypass the check |
| "HasSynced=false is correct" |
Correct state, but Get() should return error, not success with broken object |
History
This bug was introduced in commit 12c4811 (@zachsmith1, 2026-03-03, PR #14). The InformerRegistry type was created as a new abstraction in a major refactor that replaced storage fallbacks with fork-native identity propagation. The Get() method was written to create MultiClusterInformer instances on-demand and pass the registry's context to control their lifetime, but the author did not add a guard to check if the context was already cancelled before creating a new informer, leading to the contract violation where Get() returns success for an informer that will never sync.
Summary
InformerRegistrycreatesMultiClusterInformerinstances on-demand, passing its context to control the informer's lifetime.Get()returns "success" with an informer that will never sync when the registry's context is already cancelled.Get()to either return a working informer or an error; instead they get an informer that appears valid but never syncs.HasSynced()without context checks spin forever. At least one such caller exists:webhook/manager.go:139-154.Code with Bug
pkg/multicluster/informer_registry.go:45-75:Reproducible Test Case
Run with:
go test -v -run TestInformerRegistry_GetWithCancelledContext ./pkg/multicluster/Verified Affected Call Path
Path 1: webhook.Manager.envForCluster (NO context check)
pkg/multicluster/admission/webhook/manager.go:139-154:If the registry's context is cancelled when
envForClusteris called:Get()calls return informers that never syncsyncedchannel is never closedPath 2: crd_runtime_manager.ensureSharedCRDState (context.Background when stopCh is nil)
pkg/multicluster/bootstrap/crd_runtime_manager_wrapped.go:368-379:Callers
CRDGetterForRequest(line 488) andCRDListerForRequest(line 499) passnilstopCh.Contrast: auth.Manager Has Context Check
pkg/multicluster/auth/manager.go:231-238:This shows the authors understood the pattern. The missing check in
webhook/manager.gois an inconsistency, not a design choice.External Dependency Verification
go.mod:Verified source at
/root/go/pkg/mod/github.com/kplane-dev/informer@v0.0.0-20260404055613-d9279cfd5e9b/informer.go:When
ctx.Err() != nilon entry,Run()returns immediately without callinglistWatch().HasSynced()returnsfalseforever.Why This Is A Bug
1. Contract Violation
Get()should return:When the registry's context is cancelled,
Get()returns an informer that:HasSynced() == falseforeverThe caller has no way to distinguish "informer is still syncing" from "informer will never sync".
2. Resource Leak
The goroutine in
webhook/manager.go:139-154:syncedchannelEven if "the process is terminating anyway," this is a resource leak by Go's standards. Go linters (
lostcancel,govet) flag similar patterns as bugs.3. Inconsistent Defensive Patterns
The codebase has 3 "wait for HasSynced" loops:
auth/manager.go:231crd_runtime_manager_wrapped.go:372webhook/manager.go:143The inconsistency indicates an oversight, not intentional design.
Recommended Fix
Response to Reviewer Concerns
Get()should return error, not success with broken objectHistory
This bug was introduced in commit 12c4811 (@zachsmith1, 2026-03-03, PR #14). The
InformerRegistrytype was created as a new abstraction in a major refactor that replaced storage fallbacks with fork-native identity propagation. TheGet()method was written to createMultiClusterInformerinstances on-demand and pass the registry's context to control their lifetime, but the author did not add a guard to check if the context was already cancelled before creating a new informer, leading to the contract violation whereGet()returns success for an informer that will never sync.