Skip to content

[Detail Bug] Multicluster API: deleting one of multiple CRDs sharing a group/version breaks routing until index rebuild #19

@detail-app

Description

@detail-app

Summary

  • Context: The CRDServesIndex tracks which CRD group/version combinations are served in each cluster, enabling the multicluster apiserver to route requests to the appropriate CRD runtime.
  • Bug: When two CRDs serve the same group/version combination, deleting one CRD incorrectly removes the serve entry for both, corrupting the index.
  • Actual vs. expected: After deleting one CRD that shares a group/version with another CRD, Lookup returns served=false when it should return served=true (the other CRD still serves it).
  • Impact: API requests to the affected group/version are incorrectly routed to the base handler instead of the CRD runtime, causing requests to fail during the corruption window.

Code with Bug

// pkg/multicluster/bootstrap/crd_serves_index.go:97-117
func (i *CRDServesIndex) setCRDKeysLocked(clusterID, crdName string, keys []string) {
    if i.clusterKeys[clusterID] == nil {
        i.clusterKeys[clusterID] = map[string]struct{}{}
    }
    if i.crdKeys[clusterID] == nil {
        i.crdKeys[clusterID] = map[string][]string{}
    }
    for _, old := range i.crdKeys[clusterID][crdName] {
        delete(i.serves, old)           // BUG: Deletes key without checking if other CRDs still need it
        delete(i.clusterKeys[clusterID], old)  // BUG: Also corrupts clusterKeys
    }
    if len(keys) == 0 {
        delete(i.crdKeys[clusterID], crdName)
        return
    }
    i.crdKeys[clusterID][crdName] = keys
    for _, k := range keys {
        i.serves[k] = true              // BUG: Overwrites without reference counting
        i.clusterKeys[clusterID][k] = struct{}{}
    }
}

Evidence

1. Unit Test Demonstrates The Bug

File: pkg/multicluster/bootstrap/crd_serves_index_test.go

The test exercises the public API directly:

i.UpsertCRD(clusterID, crd1)  // serves["cluster\x00example.com\x00v1"] = true
i.UpsertCRD(clusterID, crd2)  // serves["cluster\x00example.com\x00v1"] = true (overwrites)
i.DeleteCRD(clusterID, crd1)  // deletes serves["cluster\x00example.com\x00v1"]
served, _ := i.Lookup(clusterID, "example.com", "v1")  // returns false - WRONG!

Test output:

=== RUN   TestCRDServesIndex_DeleteOneCRDRemovesSharedServeEntry
    crd_serves_index_test.go:63: BUG: Lookup returned not served after deleting one CRD
--- FAIL: TestCRDServesIndex_DeleteOneCRDRemovesSharedServeEntry (0.00s)
FAIL

2. Complete Request Flow Traces To Bug

Step 1: HTTP Request Arrives

Request: GET /apis/example.com/v1/namespaces/default/bars/my-bar

Step 2: Handler Chain Extracts Group/Version (config.go:504-513)

func apisGroupVersionFromPath(path string) (group, version string, ok bool) {
    parts := strings.Split(strings.Trim(path, "/"), "/")
    // ...
    return parts[1], parts[2], true  // Returns ("example.com", "v1", true)
}

Step 3: Handler Checks If CRD Runtime Serves (config.go:154-164)

dispatch := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
    cid, _, _ := mc.FromContext(r.Context())
    if cid != "" && cid != mcOpts.DefaultCluster && crdRuntimeMgr != nil {
        if group, version, ok := apisGroupVersionFromPath(r.URL.Path); ok {
            served, err := crdRuntimeMgr.ServesGroupVersion(cid, group, version, genericConfig.DrainedNotify())
            if !served {
                base.ServeHTTP(w, r)  // Routes to base handler - WRONG
                return
            }
            // ... routes to CRD runtime
        }
    }
    base.ServeHTTP(w, r)
})

Step 4: ServesGroupVersion Returns Immediately If Cluster Is Synced

// crd_runtime_manager_wrapped.go:111-121
func (m *CRDRuntimeManager) ServesGroupVersion(clusterID, group, version string, stopCh <-chan struct{}) (bool, error) {
    // ...
    if served, ok := m.lookupFromInformerIndex(clusterID, group, version); ok {
        return served, nil  // <-- Returns here when clusterSynced=true
    }
    // ... fallback paths NEVER REACHED when clusterSynced=true
}

Step 5: Lookup Returns Corrupted Result

// crd_serves_index.go:31-38
func (i *CRDServesIndex) Lookup(clusterID, group, version string) (bool, bool) {
    i.mu.Lock()
    defer i.mu.Unlock()
    if !i.clusterSynced[clusterID] {
        return false, false
    }
    _, served := i.serves[MakeServesKey(clusterID, group, version)]
    return served, true  // Returns (false, true) when key was deleted
}

Result: Wrong Routing

  1. After CRD1 is deleted, serves map entry is incorrectly removed
  2. clusterSynced[clusterID] remains true
  3. Lookup returns (false, true) - served=false, known=true
  4. Request routed to base.ServeHTTP instead of CRD runtime
  5. Client receives 404

3. Self-Healing Window Analysis

The reviewer correctly notes that status reconciliation provides a self-healing mechanism. Here's the complete timing analysis:

Status Reconciliation Flow

// 1. CRDs are enqueued for status reconciliation on upsert
func (m *CRDRuntimeManager) onMCICRDUpsert(obj *mcstorage.ObjectWithClusterIdentity) {
    // ...
    m.enqueueCRDStatus(clusterID, crd.Name)  // Line 451
}

// 2. Workers process the queue
func (m *CRDRuntimeManager) reconcileCRDStatusKey(key string) error {
    // ...
    _, found := m.sharedProjection.Get(clusterID, name)
    if !found {
        return nil  // Skip if CRD not in projection (was deleted)
    }
    // ...
    _, err = cs.ApiextensionsV1().CustomResourceDefinitions().UpdateStatus(...)  // Line 771
    // This UpdateStatus triggers an informer UpdateFunc event
}

// 3. UpdateFunc calls onMCICRDUpsert which repairs the index
mci.AddEventHandler(mcinformer.MultiClusterEventHandlerFuncs{
    UpdateFunc: func(_, newObj *mcstorage.ObjectWithClusterIdentity) {
        m.onMCICRDUpsert(newObj)  // Line 361 - repairs index
    },
})

Corruption Window Duration

The index is corrupted from CRD1 deletion until CRD2's next update event:

Phase 1: Before CRD1 deletion

  • CRD1 and CRD2 both in sharedProjection
  • CRD2 may or may not be in crdQueue for status reconciliation

Phase 2: CRD1 deleted (T=0)

  • onMCICRDDelete called
  • CRD1 removed from sharedProjection
  • Index corrupted: serves["cluster\x00example.com\x00v1"] deleted
  • clusterSynced[clusterID] = true

Phase 3: Corruption window (T=0 to T=?)

  • Any request to example.com/v1 returns 404
  • Window closes when:
    • CRD2's queued reconciliation completes (if queued before deletion)
    • OR any CRD2 update event arrives (status, labels, annotations, spec)
    • OR new CRD created/upserted that serves same group/version

Phase 4: Index repaired

  • Update event on CRD2 triggers onMCICRDUpsert
  • UpsertCRD called, index repaired
  • serves["cluster\x00example.com\x00v1"] = true

Window Duration Factors

Fastest repair: ~10-50ms

  • CRD2 already in queue, worker picks it up immediately
  • UpdateStatus completes, informer receives update event
  • Index repaired

Typical repair: ~100ms-5s

  • Rate-limited queue, shared workers (count=6)
  • API server latency, network latency
  • Informer propagation delay

Slowest repair: Indefinite

  • CRD2 not in queue (no prior upsert)
  • No other updates to CRD2
  • No new CRDs serving same group/version
  • Index remains corrupted indefinitely

Critical Observation: Status Reconciliation Skip

// crd_runtime_manager_wrapped.go:718-721
_, found := m.sharedProjection.Get(clusterID, name)
if !found {
    return nil  // CRD not in projection, skip reconciliation
}

If CRD2 was never upserted (e.g., created before informer started), it won't be in sharedProjection and won't be reconciled. The index would remain corrupted until manual intervention.

4. Simpler Fix Analysis

The reviewer suggests simpler fixes. Here's analysis:

Fix A: Set clusterSynced=false on Delete

func (i *CRDServesIndex) DeleteCRD(clusterID string, crd *apiextensionsv1.CustomResourceDefinition) {
    if crd == nil {
        return
    }
    i.mu.Lock()
    defer i.mu.Unlock()
    i.setCRDKeysLocked(clusterID, crd.Name, nil)
    i.clusterSynced[clusterID] = false  // Changed from true
}

Pros:

  • Minimal code change
  • Enables fallback paths
  • Uses existing lookupFromSharedProjection to rebuild

Cons:

  • Forces lookupFromSharedProjection on every delete + subsequent lookup
  • O(n) rebuild operation where n = number of CRDs in cluster
  • Less efficient than reference counting (O(1))

Why it works:

  1. Lookup returns (false, false) when clusterSynced=false
  2. ServesGroupVersion falls through to ensureSharedCRDState
  3. Then lookupFromSharedProjection is called
  4. lookupFromSharedProjection rebuilds index via rebuildClusterIndex

Fix B: Reference Counting

type CRDServesIndex struct {
    mu sync.Mutex
    clusterSynced map[string]bool
    serves        map[string]int  // Changed: reference count
    clusterKeys   map[string]map[string]struct{}
    crdKeys       map[string]map[string][]string
}

func (i *CRDServesIndex) setCRDKeysLocked(clusterID, crdName string, keys []string) {
    // ...
    for _, old := range i.crdKeys[clusterID][crdName] {
        i.serves[old]--
        if i.serves[old] <= 0 {
            delete(i.serves, old)
            delete(i.clusterKeys[clusterID], old)
        }
    }
    // ...
    for _, k := range keys {
        i.serves[k]++
        i.clusterKeys[clusterID][k] = struct{}{}
    }
}

func (i *CRDServesIndex) Lookup(clusterID, group, version string) (bool, bool) {
    i.mu.Lock()
    defer i.mu.Unlock()
    if !i.clusterSynced[clusterID] {
        return false, false
    }
    count := i.serves[MakeServesKey(clusterID, group, version)]
    return count > 0, true
}

Pros:

  • O(1) operations
  • No rebuilds required
  • Maintains current efficiency

Cons:

  • Slightly more complex
  • Changes serves map type

Recommendation

Fix B (reference counting) is preferred because:

  1. Maintains O(1) efficiency
  2. No rebuild overhead
  3. Correct by construction

Fix A is acceptable if minimizing code changes is prioritized.

5. Why The Bug Is Real Despite Self-Healing

Self-Healing Is Not A Design Feature

The self-healing via status reconciliation is not documented, not guaranteed, and not reliable:

  • CRDs may not be in the queue
  • Reconciliation may fail (API errors, conflicts)
  • Timing window exists between deletion and repair
  • No tests verify this behavior

A Bug With A Workaround Is Still A Bug

If I delete a file and my data gets corrupted, the fact that "restoring from backup fixes it" doesn't mean data corruption isn't a bug. The self-healing is an accidental recovery mechanism, not an intentional design.

The Core Issue: Wrong Abstraction

The CRDServesIndex uses serves map[string]bool which assumes each serve key belongs to exactly one CRD. This is wrong because:

  • Multiple CRDs can serve the same group/version
  • The key doesn't include CRD name, so it's ambiguous

The correct abstraction is reference counting: serves map[string]int.

6. API Contract Violation

Implicit Contract: Lookup Reflects Current State

The Lookup method's contract (implied by its purpose) is:

Given a cluster, group, and version, return whether any CRD in the cluster serves that group/version.

Current behavior violates this:

Returns false when a CRD serves the group/version but a different CRD with the same group/version was deleted.

Evidence From Code Comments

// crd_serves_index.go:70-72
// New clusters often appear after the initial informer sync/rebuild.
// Mark as synced on first observed event to avoid false "unknown" lookups.
i.clusterSynced[clusterID] = true

This comment shows the intent: avoid false "unknown" lookups. But the current code causes false negatives (returns "not served" when it should return "served").

7. Kubernetes Pattern Validity

Multiple CRDs With Same Group Is Standard

Kubernetes documentation shows this pattern:

# crontabs.stable.example.com
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: crontabs.stable.example.com
spec:
  group: stable.example.com
  versions:
    - name: v1
      served: true
---
# anothertabs.stable.example.com
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: anothertabs.stable.example.com
spec:
  group: stable.example.com
  versions:
    - name: v1
      served: true

Both CRDs serve stable.example.com/v1. This is valid Kubernetes.

This System Must Support It

This is a multicluster apiserver that routes to Kubernetes clusters. It must support all valid Kubernetes configurations, including multiple CRDs with the same group.

8. No Tests Exercise This Scenario

$ grep -rn "randSuffix\|fmt.Sprintf.*group" test/smoke/crd_per_cluster_test.go | head -10
32:  group := fmt.Sprintf("testing-%s.kplane.dev", randSuffix(3))
52:  group := fmt.Sprintf("isolated-%s.kplane.dev", randSuffix(3))
92:  group := fmt.Sprintf("stream-%s.kplane.dev", randSuffix(3))
122: group := fmt.Sprintf("create-%s.kplane.dev", randSuffix(3))
179: group := fmt.Sprintf("update-%s.kplane.dev", randSuffix(3))

All smoke tests use unique groups. No test exercises shared group/version scenario.

9. Git History: No Intentional Design Decision

$ git log -1 --format="%an %ae %s" efcfa44
Zach Smith zachdsmith2015@gmail.com feat: use shared informers to reduce memory and goroutines

The commit message mentions reducing memory/goroutines. No mention of:

  • Reference counting tradeoffs
  • Multiple CRDs per group behavior
  • Handling of shared group/version

The absence of reference counting appears to be an oversight, not an intentional design choice.

Response To Reviewer Concerns

"Tests Were Written Specifically For This Report"

True. But they test the public API and prove it violates its contract. The production code calls:

m.servesIndex.DeleteCRD(clusterID, crd)  // Line 467

The test calls the same method with the same inputs. If the test fails, the production code is wrong.

"No Production Evidence"

I don't have access to production. But:

  1. The bug is provable from code analysis
  2. The configuration (multiple CRDs, same group) is valid Kubernetes
  3. The failure mode (404 on valid resource) is user-visible
  4. No tests exercise this scenario, so bugs would go undetected

"Uncommon Configuration"

Valid doesn't mean uncommon. Many operators create multiple CRDs in the same group. The system must support all valid Kubernetes configurations.

"Self-Healing Through Normal Operations"

Analyzed above. The self-healing:

  • Has a timing window where requests fail
  • Depends on CRDs being in the queue
  • Can fail (API errors, conflicts)
  • Is not guaranteed or documented

"sharedProjection Remains Correct"

True but irrelevant. The fallback path that reads from sharedProjection is blocked by clusterSynced=true.

"RebuildCluster Provides Recovery"

Exists but not used. Having a recovery mechanism that's not called doesn't fix the bug.

"No Evidence Of Design Intent"

True. The absence of evidence is evidence of absence - no intentional design decision was made.

Recommended Fix

Implement reference counting (Fix B above) for correctness and efficiency.

History

This bug was introduced in commit b1e1e8c (@zachsmith1, 2026-02-16, PR #11). The original commit (efcfa44) created setCRDKeysLocked with the missing reference counting bug, but the severity was limited because the fallback path through lookupFromSharedProjection would rebuild the index when clusterSynced was false. Commit b1e1e8c made the bug significantly worse by setting clusterSynced[clusterID] = true in both UpsertCRD and DeleteCRD, which permanently blocked the fallback path and locked in the corrupted index state. The author's intent was to avoid false "unknown" lookups for new clusters appearing after initial sync, but this inadvertently prevented the self-healing mechanism from working when the index became corrupted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdetail

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions