GEP-0041: SLO monitoring proposal#42
Conversation
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Thank you @etiennnr . The document lacks some more depth, how you would do it. |
rfranzke
left a comment
There was a problem hiding this comment.
For me, the main question is why this is proposed as an extension. Currently, all observability-related features live directly in gardener/gardener, even including the metering rules. What justifies moving this out now? What criteria or decision framework should guide whether something observability-related remains in-tree or becomes an extension? I'm having difficulty seeing a clear and consistent direction for monitoring-related functionality.
Configurable SLOs
Do you have examples/use-cases for how/when users would use different values here?
|
/kind enhancement |
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Co-authored-by: Rafael Franzke <rafael.franzke@sap.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
|
@vlerenc @rfranzke @ScheererJ @timebertt Please note that I added some more details today (sorry for being last minute, but I also didn't receive much feedback until very recently). Please feel free to add some comments, which I try to answer in advance, otherwise we can discuss about it in next week's meeting. |
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
| ### Scalibility | ||
|
|
||
| The metrics used to calculate SLOs are getting scrapped by 2 Prometheus instances: the `garden-prometheus` in the runtime cluster and a dedicated `shoot-prometheus` in each shoot cluster. In order to be able to scale across a landscape with a large number of shoots, we need to make sure that the `slo-prometheus` doesn't do too much processing. | ||
|
|
||
| Hence, in order to be able to scale horizontally, the SLI processing related to metrics in the shoot clusters will be done in the `shoot-prometheus`. Then the `slo-prometheus` in the runtime cluster is only responsible for aggregating those precomputed SLIs at the shoot level (through the `aggregate-prometheus` in the seed) and calculating the overall SLOs. | ||
|
|
||
| For metrics coming from the `garden-prometheus`, since they are not expected to be as numerous as the ones coming from the shoots, we can afford to do the SLI processing directly in the `slo-prometheus`, in order to have the smallest possible impact on the existing monitoring systems. |
There was a problem hiding this comment.
Indeed, we are now introducing a future bottleneck. So far, we scale infinitely through our seeds, but by collecting data in one place, we create said future bottleneck.
Then again, I understand the urge to have this data in one place and monitor possible service degradations across all infrastructures/seeds in one place (but still with means to slice and dice, to learn of correlations auch as, e.g., noticing more issues on a particular cloud provider, but not all).
There was a problem hiding this comment.
This one here @etiennnr (mentioned by @timebertt during the steering meeting).
| - Notes: | ||
| - In addition to what's above, we should also add another probe from the shoot to the external endpoint of the kube-apiserver. This way, it should effectively filter out problems from the internet / external to us and focus on problems between the seed's load balancer and the kube-apiserver. | ||
|
|
||
| ### kube-apiserver latency |
There was a problem hiding this comment.
Latency even more than availability, may be negatively impacted by user workload/controllers.
How would a Gardener operator quickly identify whether it's really a problem with Gardener/the infrastructure or stems from user workload/controllers?
There was a problem hiding this comment.
SLO metrics are mainly there to reflect (as much as possible) how healthy our systems are from the customer perspective (in other words, how happy they are, but purely on the reliability standpoint). They are not meant for troubleshooting, that's what the other metrics and dashboard are for. In our case, since both gardener operators and the customer have a role to play with latency, it's not going to be a perfect metric.
As said at the top, we are not aiming for perfection in the initial SLOs since there is still a lot of unknowns about what's going on in the real world... However, even if the high latency is the customer's fault, I believe it's a good thing that we know so we can let them know.
| - Notes: | ||
| - The `shoot:apiserver_latency:percentage` is a metric created from a Prometheus rule, defined [here](https://github.com/gardener/gardener/blob/71650f56a7bfa555bfc5a09a1b1f97439a4b3d40/pkg/component/kubernetes/apiserver/prometheusrule.go#L181-L184). | ||
|
|
||
| ### kube-apiserver Error rate |
There was a problem hiding this comment.
Error rates even more than availability, may be negatively impacted by user workload/controllers.
How would a Gardener operator quickly identify whether it's really a problem with Gardener/the infrastructure or stems from user workload/controllers?
There was a problem hiding this comment.
Same answer as #42 (comment). Not perfect, but good enough to have a overview of the cluster's healthiness.
| Notes: | ||
| - N/A | ||
|
|
||
| ### Machine creation latency |
There was a problem hiding this comment.
Highly dependent on the cloud provider.
There was a problem hiding this comment.
Indeed, however even then, I guess that any cloud provider should normally provision under that 10 minute delay? That being said, that "normal delay" would need to be configurable of course. This metrics is already "per provider" since it's calculated per shoot
| - We need to implement a histogram metric that doesn't exist at the moment: `mcm_machine_creation_duration_minutes_bucket` | ||
| - confirm with MCM experts that the `Pending` state only happens during machine creation. | ||
|
|
||
| ### Node general availability |
There was a problem hiding this comment.
Highly dependent on the user workload.
| - **Reusing existing Prometheus instances**: Due to a high computational overhead to calculate SLOs, we didn't think it is a good idea to reuse the existing garden-prometheus, since this could potentially have an impact in other metrics. Also, since we need to keep the metrics for a longer period to calculate SLOs (more than the time window choosed for the SLOs), reusing the existing instances with extended retention could lead to resource contention. Hence, isolating the SLO ensures stability and separation of concerns. | ||
| - **Implementing SLOs as part of Gardener core**: This functionnality also adds computational overhead, both for the shoot Prometheus and the aggregate Prometheus, so we believe this should be opt-in for Gardener operators that want to use it. |
There was a problem hiding this comment.
Can you please share some data on how much we handle in the garden-prometheus? To me it would be more natural, to use the same instance and not introduce a new one for every different task (resulting in a different endpoint as well). Also, the garden one, so I thought, isn't doing terribly much today, is it?
There was a problem hiding this comment.
I'm not an expert on how much this prometheus is getting used, but this is actually something that was proposed/reviewed by monitoring colleagues. But the garden prometheus already federates metrics from seeds and from the garden-apiserver for every single shoot in it's landscape. So I'd guess quite busy already.
But, I'd say that data retention period is probably the main concern.
ScheererJ
left a comment
There was a problem hiding this comment.
Thanks for creating the proposal.
| - **SLO-based alerting**: Since we have the data to calculate SLO violations and burn rates (SRE best practice), we should also provide, as part of the extension, the capability to configure an Alertmanager based on those SLOs. Again, this should be configurable to fit the needs of each Gardener operator. | ||
| - **Monitoring infrastructure**: The extension should provide the necessary monitoring infrastructure to collect, store, and visualize SLO-related metrics. This includes Prometheus rules for SLI calculation, Perses dashboards for visualization, Prometheus alerts for SLO violations, Alertmanager to manage those alerts, etc. | ||
|
|
||
| The extension builds on the existing monitoring infrastructure (Prometheus operator, Perses operators, plutono annotations, ...), using a dedicated Prometheus instance in the runtime cluster to collect and aggregate SLO-specific metrics with minimal impact on the existing monitoring systems. |
There was a problem hiding this comment.
I am a bit confused as you mention perses and plutono. Does this make any difference? Do we rely on plutono features? Would the eventual migration to perses cause effort?
There was a problem hiding this comment.
I synced with @rickardsjp and he told me that preses should essentially be a drop-in replacement for plutono (apart from a few minor caviats)
| - SLI implementation: | ||
|
|
||
| ```promql | ||
| avg_over_time( | ||
| max( | ||
| probe_success{instance="https://api.internal_domain/healthz", type="seed"} | ||
| OR | ||
| probe_success{instance="https://kubernetes.default.svc.cluster.local/healthz", type="shoot"} | ||
| OR | ||
| on() vector(0) | ||
| )[4w:5m] | ||
| ) * 100 |
There was a problem hiding this comment.
Where should this be scraped? Should these queries be executed in the context of the shoot cluster, i.e. from within the data plane or will this be scraped from its control plane? Scraping kube-apiserver from the seed yields other results as from the shoot on most infrastructures due to kube-proxy shortcutting loadbalancer IPs. In other words, you do not necessarily measure what you expect to measure.
There was a problem hiding this comment.
As part of this extension, we won't scrape individual pods, only other prometheus servers (aka via federation). Hence, the metrics used here are already present. Just above, in the SLI specification, we say that this is at the shoot level. Hence, for this exemple, this entire block would be in the shoot prometheus as a recording rule (since the included metric are already scrapped there). Then the prometheus-slo would only federate the recording-rule result in the prometheus-shoot
There was a problem hiding this comment.
Ok, it was not obvious from the document that no new probes would be created.
The existing two probes above most likely are taken from different points. The first one (type=seed) will likely originate from the control plane while the second one (type=shoot) will originate from the data plane. There are differences in the traffic paths as noted above. Therefore, they are not directly comparable.
|
|
||
| ### Machine creation latency | ||
|
|
||
| - SLI specification: The amount of machine trasitionning from `Pending` to `Running` within 20 minutes vs the total amount of nodes `Pending`in the last 20 minutes. If no nodes were pending in the last 20 minutes, the SLO default oto 100%. This is metric is processed at the shoot level. |
There was a problem hiding this comment.
The 20 minute seem to relate to the default time out of machine-controller-manager. However, it can be configured individually per shoot cluster. Should this be considered here? For example, some shoot cluster owner may configure it more aggressively on infrastructures where node creation is fast. Then again, it may be necessary to increase this for bare metal machines, which take a lot longer to boot due to main memory checks.
There was a problem hiding this comment.
Should this be considered here? For example, some shoot cluster owner may configure it more aggressively on infrastructures where node creation is fast.
Since there are starter SLOs, we decided to go with the simplest option first. However, we could definitely make this configurable
Then again, it may be necessary to increase this for bare metal machines, which take a lot longer to boot due to main memory checks.
However, that is a very good point that we didn't expect. Meaning, we would need to make this configurable. I guess taking the default value from the shoot's configuration is what makes the most sense here. However, should we take the setting from the cluster-autoscaler or from MCM (both have a similar option)? 🤔 I guess we would need the opinion from MCM experts on this.
| - SLO Threshold: default TBD based on real world data, but this would be configurable | ||
| - Notes: | ||
| - We need to implement a histogram metric that doesn't exist at the moment: `mcm_machine_creation_duration_minutes_bucket` | ||
| - confirm with MCM experts that the `Pending` state only happens during machine creation. |
There was a problem hiding this comment.
A machine is also in Pending when the node-critical components are not ready, yet. From end-user perspective, this is still an unusable node, but it is not strictly related to machine-controller-manager.
There was a problem hiding this comment.
I don't think this is the case. See all the possible states here, I think the machine gets either in unknown or Failed state. Would be to confirm though
There was a problem hiding this comment.
My experience so far was not that the machines go from Pending through Unknown/Failed to Ready, but feel free to check. What I saw so far was that the machines stay in Pending for a potentially long period of time even if the machine has already joined the kubernetes cluster.
|
|
||
| - SLO Threshold: default TBD based on real world data, but this would be configurable | ||
| - Notes: | ||
| - For now, we won’t take nodes less than 10 minutes old into account (default wait time for nodes to become ready is 20 minutes). |
There was a problem hiding this comment.
Is the discrepancy (10 min vs. 20 min) desired? I understand this metrics rather as how nodes are available after they successfully joined the cluster.
WDYT?
There was a problem hiding this comment.
Yes. Since machine are replaced after 20 minutes, if we set the threshold to that same amount, we would barely see the metric failing since the machine gets deleted (sometimes in less than 1 minute). Hence, is we were to put 20 or higher, this SLO never trigger.
|
|
||
| ### Shoot creation latency | ||
|
|
||
| - SLI specification: The amount of shoots getting fully created within 30 minutes vs the amount of shoots getting fully created. |
There was a problem hiding this comment.
Should the 30 minutes value be configurable? For certain infrastructures, the node creation alone may take longer than this.
There was a problem hiding this comment.
Yes, we can probably make that configurable!
timebertt
left a comment
There was a problem hiding this comment.
Thanks for opening this proposal.
My team has already set up basic availability monitoring/reporting for the shoot API servers and integrated it into STACKIT's internal availability monitoring for all other products. So I expect that there is potential for collaborating on this topic in general.
However, I'm skeptical whether this should actually be an extension and whether the suggested probes are technically feasible and product-wise sensible.
Most of my concerns were already mentioned inline by the other reviewers, so I refrained from duplicating them.
| > [!NOTE] | ||
| > We are not aiming for perfection for the initial implementation, but rather to have a good starting point that can be improved over time based on real world data and experience. The goal is rather to have realistic and achievable SLOs that reflect the customer's experience and satisfaction in operating their shoot clusters. Hence, after the initial implementation, we should regularly review, adjust and add SLOs based on the data we collect and the feedback we get from customers and operators. | ||
|
|
||
| ### kube-apiserver general availability |
There was a problem hiding this comment.
I'm wondering how https://github.com/gardener-attic/connectivity-exporter is related to this endeavour and if it could be used as part of it.
There was a problem hiding this comment.
Interesting!!! I didn't even know that this existed!
Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
|
The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
You can:
/lifecycle stale |
|
/remove-lifecycle stale |
|
The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
You can:
/lifecycle stale |
Uh oh!
There was an error while loading. Please reload this page.