STOR-2962: Add SELinuxMount GA Upgrade Readiness for 5.0#2010
Conversation
|
@jsafrane: This pull request references STOR-2962 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
b814065 to
3aaeb93
Compare
| What is *the actual API object* is currently open. Ideas: | ||
|
|
||
| * A ConfigMap in a shared namespace, such as | ||
| `openshift-config/selinux-conflicts`. Does KCM have permissions to do so? |
There was a problem hiding this comment.
The carry patch would just need to add additional RBAC for the selinux-warning-controller: https://github.com/kubernetes/kubernetes/blob/b9b0ff440d5493764532348e0d80abdb7daf47b5/plugin/pkg/auth/authorizer/rbac/bootstrappolicy/controller_policy.go#L592
JoelSpeed
left a comment
There was a problem hiding this comment.
No major concerns from my perspective here
Would be interested though if you could add an example of how the end user is supposed to observe the warnings? What will the upgradeable false condition look like and how will they therefore know which pods need attention
Will there be a KCS that explains to them what actions they need to take linked from the condition?
| What is *the actual API object* is currently open. Ideas: | ||
|
|
||
| * A ConfigMap in a shared namespace, such as | ||
| `openshift-config/selinux-conflicts`. Does KCM have permissions to do so? |
There was a problem hiding this comment.
Who is the consumer of this object? Is it for end users or is this considered to be internal communication between openshift components?
There was a problem hiding this comment.
The config map is consumed only by the cluster-storage-operator.
As the nr. of bad Pods can be large (we have a cluster with 6000 of them), users need to use metrics to list the namespaces + pods. The upgradeable condition will try to point the users to the metric. There will be an alert with a longer human friendly description and name of the metric to check (and maybe a link to the console with the metric, if I find how to make it).
The question is, should the upgradeable condition say generic "there are Pods that could get broken during upgrade to 5.0 / 4.23, please see metric TBD" or should it be specific about the nr of Pods, "there are 512 Pods that could get broken during upgrade to 5.0 / 4.23, please see metric TBD"? If we want the actual number, we need to choose how often will KCM update it. Frequent updates will load the cluster unnecessarily, less often updates may give old number to the user.
I'd start with just a boolean flag instead of the actual number.
There was a problem hiding this comment.
The config map is consumed only by the cluster-storage-operator.
Given this is completely internal I think a configmap makes sense as a temporary way to co-ordinate between two components
The upgradeable condition will try to point the users to the metric. There will be an alert with a longer human friendly description and name of the metric to check (and maybe a link to the console with the metric, if I find how to make it).
In a cluster without metrics, will there be an alternative way for users to identify the pods? Is there some CLI command we could recommend via a KCS?
should it be specific about the nr of Pods
Perhaps you could update the message based on a range? E.g. there are approximately 500
There was a problem hiding this comment.
In a cluster without metrics, will there be an alternative way for users to identify the pods? Is there some CLI command we could recommend via a KCS?
I added a note to the KEP, indeed, the KCS needs to have steps how to find the Pods in a cluster without Prometheus. curl + grep scraping KCM metrics could be enough, however, we need to document how to get a token + how to find all KCMs.
There was a problem hiding this comment.
Perhaps you could update the message based on a range? E.g. there are approximately 500
I left it as an implementation detail :-).
- KCM-O is not a viable approach, it does not run in HyperShift. - Add KCS about details, so we can link it from the alert(s).
- The API object is indeed a ConfigMap. - Add not how often it will be updated + what's the content. - Add more details about the proposed KCS, especially it must have instructions how to get the affected Pods in a cluster without Prometheus.
|
|
||
| #### Single-node Deployments or MicroShift | ||
|
|
||
| No special considerations are needed for SNO. |
There was a problem hiding this comment.
There may be some special things need to do for single node deployments. I do not think StoragePerformantPolicy admission hook was enabled in those environments.
There was a problem hiding this comment.
If it is about openshift/cluster-storage-operator#664 (comment), that's microshift. And it's intentionally omitted in a paragraph below.
|
@jsafrane: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This enhancement prepares OpenShift 5.0 for the
SELinuxMountfeature going GA in Kubernetes 1.37 / OpenShift 5.1.SELinuxMountintroduces a breaking change and we'll need to mark a 5.0 cluster un-upgradeable until the cluster admin fixes their workloads or opts -out from theSELinuxMount. This enhancement proposes how to detect such workloads and how to pass the information from the component that knows it (a<carry>patch in kube-controller-manager) to a component that marks the cluster un-upgradeable (cluster-storage-operator).See metric
cluster:selinux_warning_controller_selinux_volume_conflict:countin telemetry for nr. of affected clusters. It's a very low number (not commenting publicly ;-)). Most clusters will upgrade just fine.There are some open questions about the actual API used to pass the info. Just circulating the idea about a
<carry>patch first before we dive into implementation details.Proof of concept of the
<carry>patch, using a ConfigMap inopenshift-confignamespace as "the API object": openshift/kubernetes#2671 (the actual API object is for discussion).