GEP-0041: SLO monitoring proposal by etiennnr · Pull Request #42 · gardener/enhancements

etiennnr · 2026-02-12T22:48:53Z

One-line PR description: Adding proposal for SLO monitoring previously made in SLO monitoring proposal documentation#818 (adding it here since process changed)

Issue link: Gardener SLO monitoring #41

Other comments: n/a

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

gardener-prow · 2026-02-12T22:48:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rfranzke for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vlerenc · 2026-02-13T15:06:26Z

Thank you @etiennnr . The document lacks some more depth, how you would do it.
E.g., how would you scrape "against the network topology" downstream, i.e. access seeds, possibly behind firewalls, from the runtime cluster? Would you build a VPN solution like we have between seeds and shoots? And more questions like this...

rfranzke

cc @istvanballok @chrkl

For me, the main question is why this is proposed as an extension. Currently, all observability-related features live directly in gardener/gardener, even including the metering rules. What justifies moving this out now? What criteria or decision framework should guide whether something observability-related remains in-tree or becomes an extension? I'm having difficulty seeing a clear and consistent direction for monitoring-related functionality.

Configurable SLOs

Do you have examples/use-cases for how/when users would use different values here?

rfranzke · 2026-02-20T16:11:20Z

/kind enhancement

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

Co-authored-by: Rafael Franzke <rafael.franzke@sap.com>

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

etiennnr · 2026-02-20T20:22:54Z

@vlerenc @rfranzke @ScheererJ @timebertt Please note that I added some more details today (sorry for being last minute, but I also didn't receive much feedback until very recently). Please feel free to add some comments, which I try to answer in advance, otherwise we can discuss about it in next week's meeting.

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

vlerenc · 2026-02-23T06:32:12Z

+### Scalibility
+
+The metrics used to calculate SLOs are getting scrapped by 2 Prometheus instances: the `garden-prometheus` in the runtime cluster and a dedicated `shoot-prometheus` in each shoot cluster. In order to be able to scale across a landscape with a large number of shoots, we need to make sure that the `slo-prometheus` doesn't do too much processing.
+
+Hence, in order to be able to scale horizontally, the SLI processing related to metrics in the shoot clusters will be done in the `shoot-prometheus`. Then the `slo-prometheus` in the runtime cluster is only responsible for aggregating those precomputed SLIs at the shoot level (through the `aggregate-prometheus` in the seed) and calculating the overall SLOs.
+
+For metrics coming from the `garden-prometheus`, since they are not expected to be as numerous as the ones coming from the shoots, we can afford to do the SLI processing directly in the `slo-prometheus`, in order to have the smallest possible impact on the existing monitoring systems.


Indeed, we are now introducing a future bottleneck. So far, we scale infinitely through our seeds, but by collecting data in one place, we create said future bottleneck.

Then again, I understand the urge to have this data in one place and monitor possible service degradations across all infrastructures/seeds in one place (but still with means to slice and dice, to learn of correlations auch as, e.g., noticing more issues on a particular cloud provider, but not all).

This one here @etiennnr (mentioned by @timebertt during the steering meeting).

vlerenc · 2026-02-23T06:49:17Z

+- Notes:
+  - In addition to what's above, we should also add another probe from the shoot to the external endpoint of the kube-apiserver. This way, it should effectively filter out problems from the internet / external to us and focus on problems between the seed's load balancer and the kube-apiserver.
+
+### kube-apiserver latency


Latency even more than availability, may be negatively impacted by user workload/controllers.

How would a Gardener operator quickly identify whether it's really a problem with Gardener/the infrastructure or stems from user workload/controllers?

SLO metrics are mainly there to reflect (as much as possible) how healthy our systems are from the customer perspective (in other words, how happy they are, but purely on the reliability standpoint). They are not meant for troubleshooting, that's what the other metrics and dashboard are for. In our case, since both gardener operators and the customer have a role to play with latency, it's not going to be a perfect metric.

As said at the top, we are not aiming for perfection in the initial SLOs since there is still a lot of unknowns about what's going on in the real world... However, even if the high latency is the customer's fault, I believe it's a good thing that we know so we can let them know.

vlerenc · 2026-02-23T06:49:32Z

+- Notes:
+  - The `shoot:apiserver_latency:percentage` is a metric created from a Prometheus rule, defined [here](https://github.com/gardener/gardener/blob/71650f56a7bfa555bfc5a09a1b1f97439a4b3d40/pkg/component/kubernetes/apiserver/prometheusrule.go#L181-L184).
+
+### kube-apiserver Error rate


Error rates even more than availability, may be negatively impacted by user workload/controllers.

How would a Gardener operator quickly identify whether it's really a problem with Gardener/the infrastructure or stems from user workload/controllers?

Same answer as #42 (comment). Not perfect, but good enough to have a overview of the cluster's healthiness.

vlerenc · 2026-02-23T06:51:50Z

+Notes:
+  - N/A
+
+### Machine creation latency


Highly dependent on the cloud provider.

Indeed, however even then, I guess that any cloud provider should normally provision under that 10 minute delay? That being said, that "normal delay" would need to be configurable of course. This metrics is already "per provider" since it's calculated per shoot

vlerenc · 2026-02-23T06:52:09Z

+  - We need to implement a histogram metric that doesn't exist at the moment: `mcm_machine_creation_duration_minutes_bucket`
+  - confirm with MCM experts that the `Pending` state only happens during machine creation.
+
+### Node general availability


Highly dependent on the user workload.

Again, same answer as #42 (comment)

vlerenc · 2026-02-23T06:55:04Z

+- **Reusing existing Prometheus instances**: Due to a high computational overhead to calculate SLOs, we didn't think it is a good idea to reuse the existing garden-prometheus, since this could potentially have an impact in other metrics. Also, since we need to keep the metrics for a longer period to calculate SLOs (more than the time window choosed for the SLOs), reusing the existing instances with extended retention could lead to resource contention. Hence, isolating the SLO ensures stability and separation of concerns.
+- **Implementing SLOs as part of Gardener core**: This functionnality also adds computational overhead, both for the shoot Prometheus and the aggregate Prometheus, so we believe this should be opt-in for Gardener operators that want to use it.


Can you please share some data on how much we handle in the garden-prometheus? To me it would be more natural, to use the same instance and not introduce a new one for every different task (resulting in a different endpoint as well). Also, the garden one, so I thought, isn't doing terribly much today, is it?

I'm not an expert on how much this prometheus is getting used, but this is actually something that was proposed/reviewed by monitoring colleagues. But the garden prometheus already federates metrics from seeds and from the garden-apiserver for every single shoot in it's landscape. So I'd guess quite busy already.

But, I'd say that data retention period is probably the main concern.

ScheererJ

Thanks for creating the proposal.

ScheererJ · 2026-02-23T09:05:00Z

+- **SLO-based alerting**: Since we have the data to calculate SLO violations and burn rates (SRE best practice), we should also provide, as part of the extension, the capability to configure an Alertmanager based on those SLOs. Again, this should be configurable to fit the needs of each Gardener operator.
+- **Monitoring infrastructure**: The extension should provide the necessary monitoring infrastructure to collect, store, and visualize SLO-related metrics. This includes Prometheus rules for SLI calculation, Perses dashboards for visualization, Prometheus alerts for SLO violations, Alertmanager to manage those alerts, etc.
+
+The extension builds on the existing monitoring infrastructure (Prometheus operator, Perses operators, plutono annotations, ...), using a dedicated Prometheus instance in the runtime cluster to collect and aggregate SLO-specific metrics with minimal impact on the existing monitoring systems.


I am a bit confused as you mention perses and plutono. Does this make any difference? Do we rely on plutono features? Would the eventual migration to perses cause effort?

I synced with @rickardsjp and he told me that preses should essentially be a drop-in replacement for plutono (apart from a few minor caviats)

ScheererJ · 2026-02-23T09:28:40Z

+- SLI implementation:
+
+  ```promql
+  avg_over_time(
+    max(
+      probe_success{instance="https://api.internal_domain/healthz", type="seed"}
+      OR
+      probe_success{instance="https://kubernetes.default.svc.cluster.local/healthz", type="shoot"}
+      OR 
+      on() vector(0)
+    )[4w:5m]
+  ) * 100


Where should this be scraped? Should these queries be executed in the context of the shoot cluster, i.e. from within the data plane or will this be scraped from its control plane? Scraping kube-apiserver from the seed yields other results as from the shoot on most infrastructures due to kube-proxy shortcutting loadbalancer IPs. In other words, you do not necessarily measure what you expect to measure.

As part of this extension, we won't scrape individual pods, only other prometheus servers (aka via federation). Hence, the metrics used here are already present. Just above, in the SLI specification, we say that this is at the shoot level. Hence, for this exemple, this entire block would be in the shoot prometheus as a recording rule (since the included metric are already scrapped there). Then the prometheus-slo would only federate the recording-rule result in the prometheus-shoot

Ok, it was not obvious from the document that no new probes would be created.
The existing two probes above most likely are taken from different points. The first one (type=seed) will likely originate from the control plane while the second one (type=shoot) will originate from the data plane. There are differences in the traffic paths as noted above. Therefore, they are not directly comparable.

ScheererJ · 2026-02-23T11:42:39Z

+
+### Machine creation latency
+
+- SLI specification: The amount of machine trasitionning from `Pending` to `Running` within 20 minutes vs the total amount of nodes `Pending`in the last 20 minutes. If no nodes were pending in the last 20 minutes, the SLO default oto 100%. This is metric is processed at the shoot level.


The 20 minute seem to relate to the default time out of machine-controller-manager. However, it can be configured individually per shoot cluster. Should this be considered here? For example, some shoot cluster owner may configure it more aggressively on infrastructures where node creation is fast. Then again, it may be necessary to increase this for bare metal machines, which take a lot longer to boot due to main memory checks.

Should this be considered here? For example, some shoot cluster owner may configure it more aggressively on infrastructures where node creation is fast.

Since there are starter SLOs, we decided to go with the simplest option first. However, we could definitely make this configurable

Then again, it may be necessary to increase this for bare metal machines, which take a lot longer to boot due to main memory checks.

However, that is a very good point that we didn't expect. Meaning, we would need to make this configurable. I guess taking the default value from the shoot's configuration is what makes the most sense here. However, should we take the setting from the cluster-autoscaler or from MCM (both have a similar option)? 🤔 I guess we would need the opinion from MCM experts on this.

ScheererJ · 2026-02-23T11:44:38Z

+- SLO Threshold: default TBD based on real world data, but this would be configurable
+- Notes:
+  - We need to implement a histogram metric that doesn't exist at the moment: `mcm_machine_creation_duration_minutes_bucket`
+  - confirm with MCM experts that the `Pending` state only happens during machine creation.


A machine is also in Pending when the node-critical components are not ready, yet. From end-user perspective, this is still an unusable node, but it is not strictly related to machine-controller-manager.

I don't think this is the case. See all the possible states here, I think the machine gets either in unknown or Failed state. Would be to confirm though

My experience so far was not that the machines go from Pending through Unknown/Failed to Ready, but feel free to check. What I saw so far was that the machines stay in Pending for a potentially long period of time even if the machine has already joined the kubernetes cluster.

ScheererJ · 2026-02-23T11:47:11Z

+
+- SLO Threshold: default TBD based on real world data, but this would be configurable
+- Notes:
+  - For now, we won’t take nodes less than 10 minutes old into account (default wait time for nodes to become ready is 20 minutes).


Is the discrepancy (10 min vs. 20 min) desired? I understand this metrics rather as how nodes are available after they successfully joined the cluster.
WDYT?

Yes. Since machine are replaced after 20 minutes, if we set the threshold to that same amount, we would barely see the metric failing since the machine gets deleted (sometimes in less than 1 minute). Hence, is we were to put 20 or higher, this SLO never trigger.

ScheererJ · 2026-02-23T11:50:48Z

+
+### Shoot creation latency
+
+- SLI specification: The amount of shoots getting fully created within 30 minutes vs the amount of shoots getting fully created.


Should the 30 minutes value be configurable? For certain infrastructures, the node creation alone may take longer than this.

Yes, we can probably make that configurable!

timebertt

Thanks for opening this proposal.
My team has already set up basic availability monitoring/reporting for the shoot API servers and integrated it into STACKIT's internal availability monitoring for all other products. So I expect that there is potential for collaborating on this topic in general.
However, I'm skeptical whether this should actually be an extension and whether the suggested probes are technically feasible and product-wise sensible.
Most of my concerns were already mentioned inline by the other reviewers, so I refrained from duplicating them.

timebertt · 2026-02-23T15:06:22Z

+> [!NOTE]
+> We are not aiming for perfection for the initial implementation, but rather to have a good starting point that can be improved over time based on real world data and experience. The goal is rather to have realistic and achievable SLOs that reflect the customer's experience and satisfaction in operating their shoot clusters. Hence, after the initial implementation, we should regularly review, adjust and add SLOs based on the data we collect and the feedback we get from customers and operators.
+
+### kube-apiserver general availability


I'm wondering how https://github.com/gardener-attic/connectivity-exporter is related to this endeavour and if it could be used as part of it.

Interesting!!! I didn't even know that this existed!

Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com>

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

gardener-ci-robot · 2026-03-26T14:58:30Z

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 14d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as active with /lifecycle active
Mark this PR as fresh with /remove-lifecycle stale
Mark this PR as rotten with /lifecycle rotten
Close this PR with /close

/lifecycle stale

ScheererJ · 2026-04-08T07:44:43Z

/remove-lifecycle stale

gardener-ci-robot · 2026-05-08T08:18:29Z

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 14d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as active with /lifecycle active
Mark this PR as fresh with /remove-lifecycle stale
Mark this PR as rotten with /lifecycle rotten
Close this PR with /close

/lifecycle stale

Adding proposal

0cd5012

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

gardener-prow Bot added do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 12, 2026

This was referenced Feb 12, 2026

SLO monitoring proposal gardener/documentation#818

Closed

Gardener SLO monitoring #41

Open

rfranzke reviewed Feb 20, 2026

View reviewed changes

Comment thread geps/0041-slo-monitoring/gep.yaml Outdated

gardener-prow Bot added kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Feb 20, 2026

etiennnr and others added 5 commits February 20, 2026 12:17

adding constraint

726d7e7

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

adding initial SLOs

d029e6b

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

adding link to monitoring issue

68794fe

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

Update geps/0041-slo-monitoring/gep.yaml

ea4a78f

Co-authored-by: Rafael Franzke <rafael.franzke@sap.com>

adding more details

735ff42

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

adding details

8d85bcd

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

vlerenc reviewed Feb 23, 2026

View reviewed changes

ScheererJ reviewed Feb 23, 2026

View reviewed changes

timebertt reviewed Feb 23, 2026

View reviewed changes

etiennnr and others added 8 commits February 23, 2026 13:58

Update geps/0041-slo-monitoring/README.md

dca5683

Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com>

svg + prometheus name alignment

9a72dd8

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

typo

0d34808

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

Adding clarity / consistency based on feedback

eccbb7f

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

Adding some context/link to motivation

2ceaf95

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

new svg + table of content

d21d34d

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

right title identation

3d7318c

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

svg update

23f6483

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>

gardener-prow Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 26, 2026

gardener-prow Bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2026

gardener-prow Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2026

		- Reusing existing Prometheus instances: Due to a high computational overhead to calculate SLOs, we didn't think it is a good idea to reuse the existing garden-prometheus, since this could potentially have an impact in other metrics. Also, since we need to keep the metrics for a longer period to calculate SLOs (more than the time window choosed for the SLOs), reusing the existing instances with extended retention could lead to resource contention. Hence, isolating the SLO ensures stability and separation of concerns.
		- Implementing SLOs as part of Gardener core: This functionnality also adds computational overhead, both for the shoot Prometheus and the aggregate Prometheus, so we believe this should be opt-in for Gardener operators that want to use it.


		### Machine creation latency

		- SLI specification: The amount of machine trasitionning from `Pending` to `Running` within 20 minutes vs the total amount of nodes `Pending`in the last 20 minutes. If no nodes were pending in the last 20 minutes, the SLO default oto 100%. This is metric is processed at the shoot level.


		### Shoot creation latency

		- SLI specification: The amount of shoots getting fully created within 30 minutes vs the amount of shoots getting fully created.

Conversation

etiennnr commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gardener-prow Bot commented Feb 12, 2026

Uh oh!

vlerenc commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rfranzke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rfranzke commented Feb 20, 2026

Uh oh!

etiennnr commented Feb 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etiennnr Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etiennnr Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScheererJ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etiennnr Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etiennnr commented Feb 12, 2026 •

edited

Loading

vlerenc commented Feb 13, 2026 •

edited

Loading

etiennnr Feb 23, 2026 •

edited

Loading

etiennnr Feb 23, 2026 •

edited

Loading

etiennnr Feb 23, 2026 •

edited

Loading