Skip to content

[Detail Bug] Multicluster: InformerRegistry.Get returns non-functional informer after context cancellation, causing indefinite HasSynced waits #26

@detail-app

Description

@detail-app

Summary

  • Context: InformerRegistry creates MultiClusterInformer instances on-demand, passing its context to control the informer's lifetime.
  • Bug: Get() returns "success" with an informer that will never sync when the registry's context is already cancelled.
  • Actual vs. expected: Callers expect Get() to either return a working informer or an error; instead they get an informer that appears valid but never syncs.
  • Impact: Callers that wait for HasSynced() without context checks spin forever. At least one such caller exists: webhook/manager.go:139-154.

Code with Bug

pkg/multicluster/informer_registry.go:45-75:

func (r *InformerRegistry) Get(gr schema.GroupResource) (*informer.MultiClusterInformer, error) {
	// ... checks for cached informer ...
	
	mci := informer.New(informer.Config{...})
	go mci.Run(r.ctx) // <-- BUG: Creates informer that immediately stops if r.ctx cancelled

	r.informers[gr] = mci
	return mci, nil   // <-- Returns "success" for broken informer
}

Reproducible Test Case

func TestInformerRegistry_GetWithCancelledContext(t *testing.T) {
	ctx, cancel := context.WithCancel(context.Background())
	cancel() // Pre-cancel: simulates DrainedNotify() firing

	registry := NewInformerRegistry(ctx)
	registry.RegisterStorage(schema.GroupResource{Group: "test", Resource: "things"}, mockStorage)

	mci, err := registry.Get(schema.GroupResource{Group: "test", Resource: "things"})
	if err != nil {
		t.Fatalf("unexpected error: %v", err) // Currently: no error returned
	}

	// Give the goroutine time to start and exit
	time.Sleep(50 * time.Millisecond)

	// BUG: HasSynced is false forever because Run() exited immediately
	if mci.HasSynced() {
		t.Log("informer synced")
	} else {
		t.Error("informer never syncs - Get() should have returned error instead")
	}

	// Demonstrate the wait loop hangs
	done := make(chan bool)
	go func() {
		for !mci.HasSynced() {
			time.Sleep(10 * time.Millisecond)
		}
		done <- true
	}()

	select {
	case <-done:
		t.Log("wait completed")
	case <-time.After(500 * time.Millisecond):
		t.Error("wait hangs - informer will never sync")
	}
}

Run with: go test -v -run TestInformerRegistry_GetWithCancelledContext ./pkg/multicluster/

Verified Affected Call Path

Path 1: webhook.Manager.envForCluster (NO context check)

pkg/multicluster/admission/webhook/manager.go:139-154:

// Close synced channel when all MCIs have completed initial sync.
mcis := []*mcinformer.MultiClusterInformer{nsMCI, servicesMCI, endpointSlicesMCI, mutatingMCI, validatingMCI}
go func() {
	for {
		allSynced := true
		for _, mci := range mcis {
			if !mci.HasSynced() {
				allSynced = false
				break
			}
		}
		if allSynced {
			close(synced)
			return
		}
		time.Sleep(50 * time.Millisecond)
		// NO context check - goroutine never terminates if HasSynced() is always false
	}
}()

If the registry's context is cancelled when envForCluster is called:

  1. All 5 Get() calls return informers that never sync
  2. The goroutine spins forever (20 checks/second)
  3. synced channel is never closed
  4. Goroutine leak

Path 2: crd_runtime_manager.ensureSharedCRDState (context.Background when stopCh is nil)

pkg/multicluster/bootstrap/crd_runtime_manager_wrapped.go:368-379:

ctx := context.Background()  // <-- Never cancels when stopCh is nil
if stopCh != nil {
	ctx = wait.ContextForChannel(stopCh)
}
for !mci.HasSynced() {
	select {
	case <-ctx.Done():        // <-- Never fires when stopCh is nil
		return nil, ctx.Err()
	default:
		time.Sleep(10 * time.Millisecond)
	}
}

Callers CRDGetterForRequest (line 488) and CRDListerForRequest (line 499) pass nil stopCh.

Contrast: auth.Manager Has Context Check

pkg/multicluster/auth/manager.go:231-238:

for !mci.HasSynced() {
	select {
	case <-m.ctx.Done():      // <-- Correct: checks context
		m.rbacErr = m.ctx.Err()
		return
	default:
		time.Sleep(10 * time.Millisecond)
	}
}

This shows the authors understood the pattern. The missing check in webhook/manager.go is an inconsistency, not a design choice.

External Dependency Verification

go.mod:

github.com/kplane-dev/informer v0.0.0-20260404055613-d9279cfd5e9b

Verified source at /root/go/pkg/mod/github.com/kplane-dev/informer@v0.0.0-20260404055613-d9279cfd5e9b/informer.go:

func (m *MultiClusterInformer) Run(ctx context.Context) {
	// ... setup ...
	for {
		if ctx.Err() != nil {   // Line 102: immediate return if context cancelled
			return
		}
		m.listWatch(ctx)        // Line 105: sets hasSynced = true (line 140)
	}
}

When ctx.Err() != nil on entry, Run() returns immediately without calling listWatch(). HasSynced() returns false forever.

Why This Is A Bug

1. Contract Violation

Get() should return:

  • A working informer, OR
  • An error

When the registry's context is cancelled, Get() returns an informer that:

  • Is cached (line 74)
  • Will never sync
  • Has HasSynced() == false forever

The caller has no way to distinguish "informer is still syncing" from "informer will never sync".

2. Resource Leak

The goroutine in webhook/manager.go:139-154:

  • Spins at 20 iterations/second
  • Never terminates (no context check)
  • Never closes the synced channel

Even if "the process is terminating anyway," this is a resource leak by Go's standards. Go linters (lostcancel, govet) flag similar patterns as bugs.

3. Inconsistent Defensive Patterns

The codebase has 3 "wait for HasSynced" loops:

Location Has Context Check?
auth/manager.go:231 Yes
crd_runtime_manager_wrapped.go:372 Yes (but uses context.Background when stopCh=nil)
webhook/manager.go:143 No

The inconsistency indicates an oversight, not intentional design.

Recommended Fix

func (r *InformerRegistry) Get(gr schema.GroupResource) (*informer.MultiClusterInformer, error) {
	// Check context before any work
	select {
	case <-r.ctx.Done():
		return nil, fmt.Errorf("informer registry shutdown for %s: %w", gr, r.ctx.Err())
	default:
	}

	r.mu.Lock()
	if mci, ok := r.informers[gr]; ok {
		r.mu.Unlock()
		return mci, nil
	}
	cs, ok := r.storages[gr]
	r.mu.Unlock()

	if !ok {
		return nil, fmt.Errorf("no registered storage for %s", gr)
	}

	store, err := cs.ensureStore()
	if err != nil {
		return nil, fmt.Errorf("failed to ensure store for %s: %w", gr, err)
	}

	r.mu.Lock()
	defer r.mu.Unlock()

	if mci, ok := r.informers[gr]; ok {
		return mci, nil
	}

	// Double-check after potential slow operation
	select {
	case <-r.ctx.Done():
		return nil, fmt.Errorf("informer registry shutdown for %s: %w", gr, r.ctx.Err())
	default:
	}

	mci := informer.New(informer.Config{
		Storage:        store,
		ResourcePrefix: cs.kindRootPrefix(),
		GroupResource:  gr,
	})

	go mci.Run(r.ctx)

	r.informers[gr] = mci
	return mci, nil
}

Response to Reviewer Concerns

Concern Response
"Wait has 10s timeout" Timeout prevents request hang, but goroutine still leaks
"Process terminates anyway" Poor engineering practice; Go explicitly rejects this reasoning
"Narrow race window" True, but narrow window ≠ no bug; invariant is violated
"Fix callers instead" Registry fix is centralized; caller fix would require auditing all 3 locations
"TOCTOU in fix" Race window shrinks from "forever" to "nanoseconds"; cached informers bypass the check
"HasSynced=false is correct" Correct state, but Get() should return error, not success with broken object

History

This bug was introduced in commit 12c4811 (@zachsmith1, 2026-03-03, PR #14). The InformerRegistry type was created as a new abstraction in a major refactor that replaced storage fallbacks with fork-native identity propagation. The Get() method was written to create MultiClusterInformer instances on-demand and pass the registry's context to control their lifetime, but the author did not add a guard to check if the context was already cancelled before creating a new informer, leading to the contract violation where Get() returns success for an informer that will never sync.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdetail

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions