[Detail Bug] Multicluster: InformerRegistry.Get returns non-functional informer after context cancellation, causing indefinite HasSynced waits

# Summary
- **Context**: `InformerRegistry` creates `MultiClusterInformer` instances on-demand, passing its context to control the informer's lifetime.
- **Bug**: `Get()` returns "success" with an informer that will never sync when the registry's context is already cancelled.
- **Actual vs. expected**: Callers expect `Get()` to either return a working informer or an error; instead they get an informer that appears valid but never syncs.
- **Impact**: Callers that wait for `HasSynced()` without context checks spin forever. At least one such caller exists: `webhook/manager.go:139-154`.

# Code with Bug

`pkg/multicluster/informer_registry.go:45-75`:
```go
func (r *InformerRegistry) Get(gr schema.GroupResource) (*informer.MultiClusterInformer, error) {
	// ... checks for cached informer ...
	
	mci := informer.New(informer.Config{...})
	go mci.Run(r.ctx) // <-- BUG: Creates informer that immediately stops if r.ctx cancelled

	r.informers[gr] = mci
	return mci, nil   // <-- Returns "success" for broken informer
}
```

# Reproducible Test Case

```go
func TestInformerRegistry_GetWithCancelledContext(t *testing.T) {
	ctx, cancel := context.WithCancel(context.Background())
	cancel() // Pre-cancel: simulates DrainedNotify() firing

	registry := NewInformerRegistry(ctx)
	registry.RegisterStorage(schema.GroupResource{Group: "test", Resource: "things"}, mockStorage)

	mci, err := registry.Get(schema.GroupResource{Group: "test", Resource: "things"})
	if err != nil {
		t.Fatalf("unexpected error: %v", err) // Currently: no error returned
	}

	// Give the goroutine time to start and exit
	time.Sleep(50 * time.Millisecond)

	// BUG: HasSynced is false forever because Run() exited immediately
	if mci.HasSynced() {
		t.Log("informer synced")
	} else {
		t.Error("informer never syncs - Get() should have returned error instead")
	}

	// Demonstrate the wait loop hangs
	done := make(chan bool)
	go func() {
		for !mci.HasSynced() {
			time.Sleep(10 * time.Millisecond)
		}
		done <- true
	}()

	select {
	case <-done:
		t.Log("wait completed")
	case <-time.After(500 * time.Millisecond):
		t.Error("wait hangs - informer will never sync")
	}
}
```

Run with: `go test -v -run TestInformerRegistry_GetWithCancelledContext ./pkg/multicluster/`

# Verified Affected Call Path

## Path 1: webhook.Manager.envForCluster (NO context check)

`pkg/multicluster/admission/webhook/manager.go:139-154`:
```go
// Close synced channel when all MCIs have completed initial sync.
mcis := []*mcinformer.MultiClusterInformer{nsMCI, servicesMCI, endpointSlicesMCI, mutatingMCI, validatingMCI}
go func() {
	for {
		allSynced := true
		for _, mci := range mcis {
			if !mci.HasSynced() {
				allSynced = false
				break
			}
		}
		if allSynced {
			close(synced)
			return
		}
		time.Sleep(50 * time.Millisecond)
		// NO context check - goroutine never terminates if HasSynced() is always false
	}
}()
```

If the registry's context is cancelled when `envForCluster` is called:
1. All 5 `Get()` calls return informers that never sync
2. The goroutine spins forever (20 checks/second)
3. `synced` channel is never closed
4. **Goroutine leak**

## Path 2: crd_runtime_manager.ensureSharedCRDState (context.Background when stopCh is nil)

`pkg/multicluster/bootstrap/crd_runtime_manager_wrapped.go:368-379`:
```go
ctx := context.Background()  // <-- Never cancels when stopCh is nil
if stopCh != nil {
	ctx = wait.ContextForChannel(stopCh)
}
for !mci.HasSynced() {
	select {
	case <-ctx.Done():        // <-- Never fires when stopCh is nil
		return nil, ctx.Err()
	default:
		time.Sleep(10 * time.Millisecond)
	}
}
```

Callers `CRDGetterForRequest` (line 488) and `CRDListerForRequest` (line 499) pass `nil` stopCh.

## Contrast: auth.Manager Has Context Check

`pkg/multicluster/auth/manager.go:231-238`:
```go
for !mci.HasSynced() {
	select {
	case <-m.ctx.Done():      // <-- Correct: checks context
		m.rbacErr = m.ctx.Err()
		return
	default:
		time.Sleep(10 * time.Millisecond)
	}
}
```

This shows the authors understood the pattern. The missing check in `webhook/manager.go` is an inconsistency, not a design choice.

# External Dependency Verification

`go.mod`:
```
github.com/kplane-dev/informer v0.0.0-20260404055613-d9279cfd5e9b
```

Verified source at `/root/go/pkg/mod/github.com/kplane-dev/informer@v0.0.0-20260404055613-d9279cfd5e9b/informer.go`:

```go
func (m *MultiClusterInformer) Run(ctx context.Context) {
	// ... setup ...
	for {
		if ctx.Err() != nil {   // Line 102: immediate return if context cancelled
			return
		}
		m.listWatch(ctx)        // Line 105: sets hasSynced = true (line 140)
	}
}
```

When `ctx.Err() != nil` on entry, `Run()` returns immediately without calling `listWatch()`. `HasSynced()` returns `false` forever.

# Why This Is A Bug

## 1. Contract Violation

`Get()` should return:
- A working informer, OR
- An error

When the registry's context is cancelled, `Get()` returns an informer that:
- Is cached (line 74)
- Will never sync
- Has `HasSynced() == false` forever

The caller has no way to distinguish "informer is still syncing" from "informer will never sync".

## 2. Resource Leak

The goroutine in `webhook/manager.go:139-154`:
- Spins at 20 iterations/second
- Never terminates (no context check)
- Never closes the `synced` channel

Even if "the process is terminating anyway," this is a resource leak by Go's standards. Go linters (`lostcancel`, `govet`) flag similar patterns as bugs.

## 3. Inconsistent Defensive Patterns

The codebase has 3 "wait for HasSynced" loops:

| Location | Has Context Check? |
|----------|-------------------|
| `auth/manager.go:231` | Yes |
| `crd_runtime_manager_wrapped.go:372` | Yes (but uses context.Background when stopCh=nil) |
| `webhook/manager.go:143` | **No** |

The inconsistency indicates an oversight, not intentional design.

# Recommended Fix

```go
func (r *InformerRegistry) Get(gr schema.GroupResource) (*informer.MultiClusterInformer, error) {
	// Check context before any work
	select {
	case <-r.ctx.Done():
		return nil, fmt.Errorf("informer registry shutdown for %s: %w", gr, r.ctx.Err())
	default:
	}

	r.mu.Lock()
	if mci, ok := r.informers[gr]; ok {
		r.mu.Unlock()
		return mci, nil
	}
	cs, ok := r.storages[gr]
	r.mu.Unlock()

	if !ok {
		return nil, fmt.Errorf("no registered storage for %s", gr)
	}

	store, err := cs.ensureStore()
	if err != nil {
		return nil, fmt.Errorf("failed to ensure store for %s: %w", gr, err)
	}

	r.mu.Lock()
	defer r.mu.Unlock()

	if mci, ok := r.informers[gr]; ok {
		return mci, nil
	}

	// Double-check after potential slow operation
	select {
	case <-r.ctx.Done():
		return nil, fmt.Errorf("informer registry shutdown for %s: %w", gr, r.ctx.Err())
	default:
	}

	mci := informer.New(informer.Config{
		Storage:        store,
		ResourcePrefix: cs.kindRootPrefix(),
		GroupResource:  gr,
	})

	go mci.Run(r.ctx)

	r.informers[gr] = mci
	return mci, nil
}
```

# Response to Reviewer Concerns

| Concern | Response |
|---------|----------|
| "Wait has 10s timeout" | Timeout prevents request hang, but goroutine still leaks |
| "Process terminates anyway" | Poor engineering practice; Go explicitly rejects this reasoning |
| "Narrow race window" | True, but narrow window ≠ no bug; invariant is violated |
| "Fix callers instead" | Registry fix is centralized; caller fix would require auditing all 3 locations |
| "TOCTOU in fix" | Race window shrinks from "forever" to "nanoseconds"; cached informers bypass the check |
| "HasSynced=false is correct" | Correct state, but `Get()` should return error, not success with broken object |

# History
This bug was introduced in commit 12c4811 (@zachsmith1, 2026-03-03, PR #14). The `InformerRegistry` type was created as a new abstraction in a major refactor that replaced storage fallbacks with fork-native identity propagation. The `Get()` method was written to create `MultiClusterInformer` instances on-demand and pass the registry's context to control their lifetime, but the author did not add a guard to check if the context was already cancelled before creating a new informer, leading to the contract violation where `Get()` returns success for an informer that will never sync.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Detail Bug] Multicluster: InformerRegistry.Get returns non-functional informer after context cancellation, causing indefinite HasSynced waits #26

Summary

Code with Bug

Reproducible Test Case

Verified Affected Call Path

Path 1: webhook.Manager.envForCluster (NO context check)

Path 2: crd_runtime_manager.ensureSharedCRDState (context.Background when stopCh is nil)

Contrast: auth.Manager Has Context Check

External Dependency Verification

Why This Is A Bug

1. Contract Violation

2. Resource Leak

3. Inconsistent Defensive Patterns

Recommended Fix

Response to Reviewer Concerns

History

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Location	Has Context Check?
`auth/manager.go:231`	Yes
`crd_runtime_manager_wrapped.go:372`	Yes (but uses context.Background when stopCh=nil)
`webhook/manager.go:143`	No

Concern	Response
"Wait has 10s timeout"	Timeout prevents request hang, but goroutine still leaks
"Process terminates anyway"	Poor engineering practice; Go explicitly rejects this reasoning
"Narrow race window"	True, but narrow window ≠ no bug; invariant is violated
"Fix callers instead"	Registry fix is centralized; caller fix would require auditing all 3 locations
"TOCTOU in fix"	Race window shrinks from "forever" to "nanoseconds"; cached informers bypass the check
"HasSynced=false is correct"	Correct state, but `Get()` should return error, not success with broken object

[Detail Bug] Multicluster: InformerRegistry.Get returns non-functional informer after context cancellation, causing indefinite HasSynced waits #26

Description

Summary

Code with Bug

Reproducible Test Case

Verified Affected Call Path

Path 1: webhook.Manager.envForCluster (NO context check)

Path 2: crd_runtime_manager.ensureSharedCRDState (context.Background when stopCh is nil)

Contrast: auth.Manager Has Context Check

External Dependency Verification

Why This Is A Bug

1. Contract Violation

2. Resource Leak

3. Inconsistent Defensive Patterns

Recommended Fix

Response to Reviewer Concerns

History

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions