Added ValidatingAdmissionPolicy check to find policies which block cluster upgrades#180
Open
ericsmith-do wants to merge 1 commit into
Open
Added ValidatingAdmissionPolicy check to find policies which block cluster upgrades#180ericsmith-do wants to merge 1 commit into
ericsmith-do wants to merge 1 commit into
Conversation
c9ca212 to
d61182c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
DOKS clusters can get stuck mid-upgrade when a
ValidatingAdmissionPolicydenies the operations our reconciler needs to roll the worker nodes. We hit this recently when a cluster sat inreconcile pendingfor ages with nothing useful in the reconciler logs, and it turned out asafe-upgrades.gateway.networking.k8s.iopolicy was quietly rejecting the reconciler's requests. Someone had to manually poke around the cluster's CRDs to figure that out.clusterlint already catches the equivalent problem for admission webhooks, but it had no idea ValidatingAdmissionPolicies existed. This PR adds a new check so we catch this class of problem automatically instead of debugging it by hand.
The scoping is intentionally conservative , matching the webhook check. The literal policy from the incident targets Gateway API resources cluster-wide, so it wouldn't be caught as-is. Happy to broaden this if we'd rather err toward more findings.
Testing
Unit tests cover the meta/registration plus every filter branch. I also ran it against a real cluster to confirm the fetch wiring and RBAC, not just the logic:
kind create cluster --image kindest/node:v1.30.0clusterlint run -c validating-admission-policyreported the error as expected.Warn→ no findingfailurePolicy: Ignore→ no findingbatch/jobs → no findingSo only the genuinely risky shape gets flagged, and breaking any single condition silences it.