fix(controller): reset stale REST mapper instead of restarting pod by IvanHunters · Pull Request #15 · cozystack/kilo-clustermesh-operator

IvanHunters · 2026-05-28T00:02:08Z

Summary

A freshly bootstrapped tenant cluster installs its Peer CRD only after
the operator has reconciled the parent ClusterMesh CR for the first
time. The reconcile's List(PeerList{}) against that target then hits
NoKindMatchError, controller-runtime's REST mapper caches the negative
discovery entry for the lifetime of cluster.Cluster, and every
subsequent peer push to that target fails forever — until the operator
pod is restarted by hand. This is a permanent stall, not a transient
error: requeue-with-backoff never refreshes discovery.

Observed in the wild: a freshly-recreated tenant mesh1 deadlocked at
ceph.peers=0 for ~13 minutes while a sibling tenant mesh2, whose CRD
landed before the operator's first reconcile, kept reconciling
successfully. Both tenants shared the same operator pod.

Fix

Replace the every-target REST mapper with a resettable wrapper, and
call Reset() on it from the reconciler when a NoKindMatchError
surfaces. The next reconcile re-discovers the missing kind over one
RPC; no pod restart, no leader-lease churn.

Why a wrapper

cluster.New builds its mapper through
apiutil.NewDynamicRESTMapper, which returns an unexported
*apiutil.mapper. That type does not implement
meta.ResettableRESTMapper — the upstream interface guard does not
exist on any controller-runtime release through v0.24.0. Without a
wrapper, mapper.(meta.ResettableRESTMapper) inside
restart.RefreshMapperOnNoMatch would silently fail the type
assertion and Reset would never fire — the previous draft of this PR
did exactly that and shipped as a placebo, which the new test-guard
catches.

multicluster.resettableDynamicMapper is a thin proxy:

delegates every meta.RESTMapper method to a current dynamic
mapper under an RWMutex,
implements Reset() by constructing a fresh
apiutil.NewDynamicRESTMapper and atomically swapping the
underlying mapper. NewDynamicRESTMapper is lazy about discovery,
so the rebuild is cheap (no RPC at construction). The next
List() against an uncached kind triggers one discovery
round-trip against the target apiserver,
preserves the previous mapper on rebuild failure so callers still
have a working mapper for kinds already in cache; the next Reset
retries.

The wrapper is plugged in via cluster.Options.MapperProvider in
multicluster.Build, so every remote cluster.Cluster the operator
owns ends up with a resettable mapper. The reconciler reaches all of
them through r.Registry.Mappers() and resets every mapper on a
NoKindMatchError: the wrapped error chain does not carry the
specific failed-target name at the Reconcile boundary, and
Reset() is cheap enough that fan-out cache invalidation is an
acceptable tradeoff.

Why not the original self-cancel approach

Earlier drafts of this PR ended the manager's root context so kubelet
would restart the pod. It worked end-to-end (verified in the wild),
but the cure was disproportionate to the disease: every NoMatch event
dropped the leader lease, killed every in-flight reconcile across
every tenant, and after a handful of consecutive stale-CRD events the
pod inflated to CrashLoopBackOff with a multi-minute backoff. Mapper
reset is surgical: it scopes recovery to discovery-cache invalidation
and leaves the rest of the operator running.

Other changes

directCluster.GetRESTMapper() returns nil so envtest-driven
integration tests stay safe; ClusterRegistry.Mappers() skips nil
entries.
Cancel field removed from ClusterMeshReconciler. The cancel
signal stays wired through ChangeWatcher where the pod-restart
shape is correct (kubeconfig fingerprint drift legitimately
requires rebuilding the registry).

Test plan

go test ./... (excluding integration which requires envtest
assets) passes, including:
- Three new tests on resettableDynamicMapper:
  TestResettableDynamicMapperImplementsResettable is the
  load-bearing regression guard — if the wrapper ever stops
  satisfying meta.ResettableRESTMapper, CI fails immediately;
  TestRawDynamicMapperIsNotResettable documents the upstream
  gap and will flip when controller-runtime closes it, telling us
  to delete the wrapper; TestResettableDynamicMapperResetReplacesUnderlying
  proves Reset is not a stub.
- Seven unit cases on RefreshMapperOnNoMatch (nil err, nil
  mapper, unrelated err, plain NoKindMatchError, double-wrapped
  NoKindMatchError, non-resettable mapper, nil logger).
go build ./... clean.
golangci-lint run ./... zero issues.
Manual reproduction on a Cozystack hosting cluster: tenant
whose Peer CRD lands after first reconcile should recover on the
next reconcile tick with no operator pod restart and no leader-lease
churn observed.

Self-cancel helper that fires the manager's context when an error chain carries a NoKindMatchError. Mirrors ChangeWatcher's restart-on-drift pattern so the operator recovers from stale remote-cluster discovery without operator intervention. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

A target cluster's REST mapper caches discovery results for the lifetime of cluster.Cluster. When a freshly bootstrapped tenant installs its Peer CRD only after the operator's first reconcile of its ClusterMesh CR, the mapper holds a NoKindMatchError forever and every subsequent peer push fails until the pod restarts. Plumb the manager's cancel through ClusterMeshReconciler and invoke TriggerOnStaleDiscovery on every error return so kubelet rebuilds the registry against current discovery. Same recovery shape as the existing ChangeWatcher fingerprint-drift restart. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

coderabbitai · 2026-05-28T00:02:20Z

📝 Walkthrough

Walkthrough

The PR adds automatic pod restart recovery when the Kubernetes REST mapper becomes stale. A new TriggerOnStaleDiscovery helper detects meta.NoKindMatchError, logs context, and invokes a cancel function to restart the pod. The reconciler is refactored to call this helper on errors, and the cancel function is wired from main.go through the dependency injection layer.

Changes

Pod Restart on Stale Discovery

Layer / File(s)	Summary
Stale discovery detection helper `internal/restart/trigger.go`, `internal/restart/trigger_test.go`	New `TriggerOnStaleDiscovery` function detects `meta.NoKindMatchError`, optionally logs the `GroupKind`, and invokes the provided cancel function. Tests verify behavior across nil inputs, plain and wrapped errors, and logger tolerance.
Reconciler stale discovery integration `internal/controller/clustermesh_controller.go`	`ClusterMeshReconciler` adds `Cancel context.CancelFunc` field, imports restart package, and refactors `Reconcile` to invoke `TriggerOnStaleDiscovery` on errors. Reconciliation logic is extracted to a `reconcile` helper with simplified error propagation.
Cancel function wiring `cmd/main.go`	The `run()` function passes the operator shutdown cancel function to `wireReconciler`, which assigns it to the reconciler's `Cancel` field.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

Arsolitt

Poem

🐰 A pod restart on stale discovery appears,
When the mapper grows dusty through kubernetes tears,
The cancel bell rings, and the pod takes a bow,
Refreshed and reborn—a fresh start is now! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately reflects the main change: adding a stale REST mapper detection mechanism that triggers pod restart instead of manual intervention.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/self-restart-on-stale-discovery-cache

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a self-restart mechanism to handle stale discovery caches (specifically NoKindMatchError) on remote clusters by propagating the context's CancelFunc to the ClusterMeshReconciler. A new restart package is added with helper functions and unit tests to trigger this cancellation. The review feedback suggests checking the return value of TriggerOnStaleDiscovery in the reconciliation loop to return ctrl.Result{}, nil when a restart is triggered, preventing false-positive error logs and redundant requeues in controller-runtime during pod shutdown.

gemini-code-assist · 2026-05-28T00:03:04Z

+	restart.TriggerOnStaleDiscovery(err, r.Cancel, log)
+
+	return ctrl.Result{}, err


While TriggerOnStaleDiscovery returns a boolean indicating whether a self-restart was triggered, the return value is currently ignored here.

If a self-restart is triggered, the context is cancelled and the manager is shutting down. Returning the raw NoKindMatchError to controller-runtime will cause it to log a false-positive reconciler error and attempt to requeue the request, which is redundant and can trigger false alarms in log-monitoring systems.

Instead, we should check the return value of TriggerOnStaleDiscovery and return ctrl.Result{}, nil if a restart was triggered.

if restart.TriggerOnStaleDiscovery(err, r.Cancel, log) { return ctrl.Result{}, nil } return ctrl.Result{}, err

…n stale discovery When a reconcile error carries a NoKindMatchError, reset every remote cluster's REST mapper so the next reconcile re-discovers the missing kind. The previous self-cancel approach killed the leader lease and risked CrashLoopBackOff under repeated stale-CRD events. Mapper reset is scoped to cache invalidation, costs one discovery RPC on the next List, and keeps the rest of the operator running. - multicluster.ClusterRegistry gains Mappers() returning every remote cluster's REST mapper; directCluster.GetRESTMapper() returns nil so envtest-driven integration tests stay safe. - restart.RefreshMapperOnNoMatch replaces TriggerOnStaleDiscovery; accepts a single mapper, type-asserts to meta.ResettableRESTMapper. - ClusterMeshReconciler.Reconcile loops over Mappers() on every return. Cancel field removed from the reconciler; ChangeWatcher still cancels on kubeconfig drift where that is the correct shape. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

…fire controller-runtime's apiutil.NewDynamicRESTMapper returns an unexported *mapper that has no Reset() method, so the meta ResettableRESTMapper type assertion inside restart.RefreshMapperOnNoMatch was failing in production and the recovery path was silently no-opping. The previous commit shipped a placebo: unit tests passed against a hand-rolled spy but the prod mapper never got reset, so the original mesh1 deadlock would not self-heal as advertised. Introduce resettableDynamicMapper, a thin proxy around apiutil.NewDynamicRESTMapper that implements meta.ResettableRESTMapper. Reset() atomically swaps the underlying mapper for a freshly constructed one; NewDynamicRESTMapper is lazy about discovery so the rebuild is cheap. On rebuild failure the old mapper is preserved so callers still have a working mapper for kinds already in cache. The wrapper is wired into multicluster.Build via cluster.Options.MapperProvider so every remote cluster.Cluster ends up with a resettable mapper. Tests: - TestResettableDynamicMapperImplementsResettable: load-bearing invariant — guards against the regression that just happened. - TestRawDynamicMapperIsNotResettable: documents the upstream gap; flips green and tells us to delete the wrapper if controller-runtime ever adds Reset to its dynamic mapper. - TestResettableDynamicMapperResetReplacesUnderlying: Reset is not a no-op stub. Also: log message in restart.RefreshMapperOnNoMatch said "target cluster"; on the reconciler's fan-out path it is one of N remote clusters, not a single target. Reword to "remote-cluster". Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

Copilot

Pull request overview

Replaces the prior "self-cancel the manager and let kubelet restart the pod" recovery from NoKindMatchError on remote clusters with a surgical mapper-reset: every remote cluster.Cluster is now built with a resettable wrapper around apiutil.NewDynamicRESTMapper, and the reconciler invokes Reset() on every registered mapper after a reconcile that surfaces a NoKindMatchError. This avoids leader-lease churn and CrashLoopBackOff caused by repeated pod restarts when tenant CRDs land after the operator's first reconcile.

Changes:

Add multicluster.resettableDynamicMapper, a meta.ResettableRESTMapper wrapping apiutil.NewDynamicRESTMapper, wired in via cluster.Options.MapperProvider in multicluster.Build.
Add restart.RefreshMapperOnNoMatch plus ClusterRegistry.Mappers(), and call them from the refactored ClusterMeshReconciler.Reconcile so a NoKindMatchError triggers a discovery-cache invalidation instead of a pod restart.
Add unit coverage for the wrapper (incl. a regression guard on meta.ResettableRESTMapper satisfaction and an upstream-gap marker) and for the trigger helper across nil/wrapped/non-resettable inputs.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
internal/restart/trigger.go	New helper that detects `NoKindMatchError` in an error chain and resets a `meta.ResettableRESTMapper`.
internal/restart/trigger_test.go	Unit tests covering nil/unrelated/plain/wrapped/non-resettable/nil-logger paths.
internal/multicluster/mapper.go	Resettable wrapper around `apiutil.NewDynamicRESTMapper` with atomic swap on `Reset()`.
internal/multicluster/mapper_test.go	Tests for the wrapper, including regression guard for `meta.ResettableRESTMapper`.
internal/multicluster/registry.go	Wires the wrapper into `cluster.New`, adds `Mappers()`, and makes `directCluster.GetRESTMapper` return nil.
internal/controller/clustermesh_controller.go	Splits `Reconcile` into inner `reconcile`; on return, fans `RefreshMapperOnNoMatch` across all registered mappers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+// handful of restarts), lets the manager keep its leader lease, and
+// scopes recovery to the one cluster whose CRD state actually drifted.
+// The next reconcile picks up the fresh mapping via Discovery and the
+// peer push succeeds without further intervention.
+//
+// mapper may be nil — the call is then a no-op. If the mapper does not
+// implement meta.ResettableRESTMapper (an in-memory test fake, say) the
+// call also no-ops; that is acceptable because the production code path
+// always builds clusters through controller-runtime's dynamic mapper.
+// log may be nil — logging is then skipped.
+// Returns true when Reset() was actually invoked, so callers can avoid
+// double-emitting the recovery event.


+func (r *resettableDynamicMapper) Reset() {
+	fresh, err := apiutil.NewDynamicRESTMapper(r.cfg, r.httpClient)
+	if err != nil {
+		return
+	}
+
+	r.mu.Lock()
+	r.mapper = fresh
+	r.mu.Unlock()
+}


…17) The reconciler watches only ClusterMesh CRs, not the Nodes of remote clusters. If the first reconcile runs against a source cluster whose apiserver answers but whose kubelet has not joined yet (or whose kubeconfig secret is not in the registry at all), reconcile completes without error: ensureNodeEndpoints sees zero nodes, pushPeersToTargets emits an empty desired peer set, no useful state changes. controller runtime then has nothing to requeue on, and the next attempt waits for an external event on the CR — which can take many minutes (in the wild we saw a 17-minute mesh-up gap on a freshly bootstrapped tenant whose VM landed a few minutes after the operator's first reconcile). Flag the run as "incomplete" when any source is missing from the registry or returned zero nodes, and translate that into a RequeueAfter at the Reconcile boundary. The interval is a short constant (30s) so freshly bootstrapped tenants converge promptly without waiting on a stray watch event. Errors are unaffected: controller-runtime keeps its own exponential backoff for those. This is the source-side dual of the PR #15 fix (which addressed the target-side NoMatchError on the Peer CRD via mapper Reset). Together they close the two halves of the freshly-bootstrapped-tenant race. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>

IvanHunters added 2 commits May 28, 2026 02:54

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

IvanHunters changed the title ~~fix(controller): self-restart on stale remote-cluster discovery cache~~ fix(controller): reset stale REST mapper instead of restarting pod May 28, 2026

IvanHunters requested a review from Copilot May 28, 2026 00:46

Copilot started reviewing on behalf of IvanHunters May 28, 2026 00:46 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Arsolitt (Arsolitt) approved these changes May 28, 2026

View reviewed changes

IvanHunters merged commit 69faf59 into main May 28, 2026
10 checks passed

IvanHunters deleted the fix/self-restart-on-stale-discovery-cache branch May 28, 2026 07:01

IvanHunters mentioned this pull request May 28, 2026

fix(controller): requeue while source cluster is still bootstrapping #17

Merged

5 tasks

IvanHunters mentioned this pull request May 28, 2026

fix(controller,validation): classify node skips, requeue only when recovery is plausible #18

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(controller): reset stale REST mapper instead of restarting pod#15

fix(controller): reset stale REST mapper instead of restarting pod#15
IvanHunters merged 4 commits into
mainfrom
fix/self-restart-on-stale-discovery-cache

IvanHunters commented May 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		restart.TriggerOnStaleDiscovery(err, r.Cancel, log)

		return ctrl.Result{}, err

Conversation

IvanHunters commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Why a wrapper

Why not the original self-cancel approach

Other changes

Test plan

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

IvanHunters commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading