Skip to content

GEP-0041: SLO monitoring proposal#42

Open
etiennnr wants to merge 15 commits into
gardener:mainfrom
etiennnr:main
Open

GEP-0041: SLO monitoring proposal#42
etiennnr wants to merge 15 commits into
gardener:mainfrom
etiennnr:main

Conversation

@etiennnr
Copy link
Copy Markdown
Member

@etiennnr etiennnr commented Feb 12, 2026

  • Other comments: n/a

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
@gardener-prow
Copy link
Copy Markdown

gardener-prow Bot commented Feb 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rfranzke for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow Bot added do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 12, 2026
@vlerenc
Copy link
Copy Markdown
Member

vlerenc commented Feb 13, 2026

Thank you @etiennnr . The document lacks some more depth, how you would do it.
E.g., how would you scrape "against the network topology" downstream, i.e. access seeds, possibly behind firewalls, from the runtime cluster? Would you build a VPN solution like we have between seeds and shoots? And more questions like this...

Copy link
Copy Markdown
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @istvanballok @chrkl

For me, the main question is why this is proposed as an extension. Currently, all observability-related features live directly in gardener/gardener, even including the metering rules. What justifies moving this out now? What criteria or decision framework should guide whether something observability-related remains in-tree or becomes an extension? I'm having difficulty seeing a clear and consistent direction for monitoring-related functionality.

Configurable SLOs

Do you have examples/use-cases for how/when users would use different values here?

Comment thread geps/0041-slo-monitoring/gep.yaml Outdated
@rfranzke
Copy link
Copy Markdown
Member

/kind enhancement

@gardener-prow gardener-prow Bot added kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Feb 20, 2026
etiennnr and others added 5 commits February 20, 2026 12:17
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Co-authored-by: Rafael Franzke <rafael.franzke@sap.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
@etiennnr
Copy link
Copy Markdown
Member Author

@vlerenc @rfranzke @ScheererJ @timebertt Please note that I added some more details today (sorry for being last minute, but I also didn't receive much feedback until very recently). Please feel free to add some comments, which I try to answer in advance, otherwise we can discuss about it in next week's meeting.

Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Comment thread geps/0041-slo-monitoring/README.md Outdated
Comment thread geps/0041-slo-monitoring/README.md Outdated
Comment thread geps/0041-slo-monitoring/README.md
Comment thread geps/0041-slo-monitoring/README.md Outdated
Comment thread geps/0041-slo-monitoring/README.md Outdated
Comment on lines +102 to +108
### Scalibility

The metrics used to calculate SLOs are getting scrapped by 2 Prometheus instances: the `garden-prometheus` in the runtime cluster and a dedicated `shoot-prometheus` in each shoot cluster. In order to be able to scale across a landscape with a large number of shoots, we need to make sure that the `slo-prometheus` doesn't do too much processing.

Hence, in order to be able to scale horizontally, the SLI processing related to metrics in the shoot clusters will be done in the `shoot-prometheus`. Then the `slo-prometheus` in the runtime cluster is only responsible for aggregating those precomputed SLIs at the shoot level (through the `aggregate-prometheus` in the seed) and calculating the overall SLOs.

For metrics coming from the `garden-prometheus`, since they are not expected to be as numerous as the ones coming from the shoots, we can afford to do the SLI processing directly in the `slo-prometheus`, in order to have the smallest possible impact on the existing monitoring systems.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we are now introducing a future bottleneck. So far, we scale infinitely through our seeds, but by collecting data in one place, we create said future bottleneck.

Then again, I understand the urge to have this data in one place and monitor possible service degradations across all infrastructures/seeds in one place (but still with means to slice and dice, to learn of correlations auch as, e.g., noticing more issues on a particular cloud provider, but not all).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one here @etiennnr (mentioned by @timebertt during the steering meeting).

Comment thread geps/0041-slo-monitoring/README.md Outdated
- Notes:
- In addition to what's above, we should also add another probe from the shoot to the external endpoint of the kube-apiserver. This way, it should effectively filter out problems from the internet / external to us and focus on problems between the seed's load balancer and the kube-apiserver.

### kube-apiserver latency
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latency even more than availability, may be negatively impacted by user workload/controllers.

How would a Gardener operator quickly identify whether it's really a problem with Gardener/the infrastructure or stems from user workload/controllers?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SLO metrics are mainly there to reflect (as much as possible) how healthy our systems are from the customer perspective (in other words, how happy they are, but purely on the reliability standpoint). They are not meant for troubleshooting, that's what the other metrics and dashboard are for. In our case, since both gardener operators and the customer have a role to play with latency, it's not going to be a perfect metric.

As said at the top, we are not aiming for perfection in the initial SLOs since there is still a lot of unknowns about what's going on in the real world... However, even if the high latency is the customer's fault, I believe it's a good thing that we know so we can let them know.

Comment thread geps/0041-slo-monitoring/README.md Outdated
- Notes:
- The `shoot:apiserver_latency:percentage` is a metric created from a Prometheus rule, defined [here](https://github.com/gardener/gardener/blob/71650f56a7bfa555bfc5a09a1b1f97439a4b3d40/pkg/component/kubernetes/apiserver/prometheusrule.go#L181-L184).

### kube-apiserver Error rate
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error rates even more than availability, may be negatively impacted by user workload/controllers.

How would a Gardener operator quickly identify whether it's really a problem with Gardener/the infrastructure or stems from user workload/controllers?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as #42 (comment). Not perfect, but good enough to have a overview of the cluster's healthiness.

Comment thread geps/0041-slo-monitoring/README.md Outdated
Notes:
- N/A

### Machine creation latency
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highly dependent on the cloud provider.

Copy link
Copy Markdown
Member Author

@etiennnr etiennnr Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, however even then, I guess that any cloud provider should normally provision under that 10 minute delay? That being said, that "normal delay" would need to be configurable of course. This metrics is already "per provider" since it's calculated per shoot

Comment thread geps/0041-slo-monitoring/README.md Outdated
- We need to implement a histogram metric that doesn't exist at the moment: `mcm_machine_creation_duration_minutes_bucket`
- confirm with MCM experts that the `Pending` state only happens during machine creation.

### Node general availability
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highly dependent on the user workload.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, same answer as #42 (comment)

Comment thread geps/0041-slo-monitoring/README.md Outdated
Comment on lines +351 to +352
- **Reusing existing Prometheus instances**: Due to a high computational overhead to calculate SLOs, we didn't think it is a good idea to reuse the existing garden-prometheus, since this could potentially have an impact in other metrics. Also, since we need to keep the metrics for a longer period to calculate SLOs (more than the time window choosed for the SLOs), reusing the existing instances with extended retention could lead to resource contention. Hence, isolating the SLO ensures stability and separation of concerns.
- **Implementing SLOs as part of Gardener core**: This functionnality also adds computational overhead, both for the shoot Prometheus and the aggregate Prometheus, so we believe this should be opt-in for Gardener operators that want to use it.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please share some data on how much we handle in the garden-prometheus? To me it would be more natural, to use the same instance and not introduce a new one for every different task (resulting in a different endpoint as well). Also, the garden one, so I thought, isn't doing terribly much today, is it?

Copy link
Copy Markdown
Member Author

@etiennnr etiennnr Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on how much this prometheus is getting used, but this is actually something that was proposed/reviewed by monitoring colleagues. But the garden prometheus already federates metrics from seeds and from the garden-apiserver for every single shoot in it's landscape. So I'd guess quite busy already.

But, I'd say that data retention period is probably the main concern.

Copy link
Copy Markdown
Member

@ScheererJ ScheererJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating the proposal.

Comment thread geps/0041-slo-monitoring/README.md Outdated
Comment thread geps/0041-slo-monitoring/slo-extension-plan.png Outdated
- **SLO-based alerting**: Since we have the data to calculate SLO violations and burn rates (SRE best practice), we should also provide, as part of the extension, the capability to configure an Alertmanager based on those SLOs. Again, this should be configurable to fit the needs of each Gardener operator.
- **Monitoring infrastructure**: The extension should provide the necessary monitoring infrastructure to collect, store, and visualize SLO-related metrics. This includes Prometheus rules for SLI calculation, Perses dashboards for visualization, Prometheus alerts for SLO violations, Alertmanager to manage those alerts, etc.

The extension builds on the existing monitoring infrastructure (Prometheus operator, Perses operators, plutono annotations, ...), using a dedicated Prometheus instance in the runtime cluster to collect and aggregate SLO-specific metrics with minimal impact on the existing monitoring systems.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused as you mention perses and plutono. Does this make any difference? Do we rely on plutono features? Would the eventual migration to perses cause effort?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I synced with @rickardsjp and he told me that preses should essentially be a drop-in replacement for plutono (apart from a few minor caviats)

Comment thread geps/0041-slo-monitoring/README.md Outdated
Comment on lines +165 to +176
- SLI implementation:

```promql
avg_over_time(
max(
probe_success{instance="https://api.internal_domain/healthz", type="seed"}
OR
probe_success{instance="https://kubernetes.default.svc.cluster.local/healthz", type="shoot"}
OR
on() vector(0)
)[4w:5m]
) * 100
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where should this be scraped? Should these queries be executed in the context of the shoot cluster, i.e. from within the data plane or will this be scraped from its control plane? Scraping kube-apiserver from the seed yields other results as from the shoot on most infrastructures due to kube-proxy shortcutting loadbalancer IPs. In other words, you do not necessarily measure what you expect to measure.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of this extension, we won't scrape individual pods, only other prometheus servers (aka via federation). Hence, the metrics used here are already present. Just above, in the SLI specification, we say that this is at the shoot level. Hence, for this exemple, this entire block would be in the shoot prometheus as a recording rule (since the included metric are already scrapped there). Then the prometheus-slo would only federate the recording-rule result in the prometheus-shoot

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it was not obvious from the document that no new probes would be created.
The existing two probes above most likely are taken from different points. The first one (type=seed) will likely originate from the control plane while the second one (type=shoot) will originate from the data plane. There are differences in the traffic paths as noted above. Therefore, they are not directly comparable.

Comment thread geps/0041-slo-monitoring/README.md Outdated
Comment thread geps/0041-slo-monitoring/README.md Outdated

### Machine creation latency

- SLI specification: The amount of machine trasitionning from `Pending` to `Running` within 20 minutes vs the total amount of nodes `Pending`in the last 20 minutes. If no nodes were pending in the last 20 minutes, the SLO default oto 100%. This is metric is processed at the shoot level.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 20 minute seem to relate to the default time out of machine-controller-manager. However, it can be configured individually per shoot cluster. Should this be considered here? For example, some shoot cluster owner may configure it more aggressively on infrastructures where node creation is fast. Then again, it may be necessary to increase this for bare metal machines, which take a lot longer to boot due to main memory checks.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be considered here? For example, some shoot cluster owner may configure it more aggressively on infrastructures where node creation is fast.

Since there are starter SLOs, we decided to go with the simplest option first. However, we could definitely make this configurable

Then again, it may be necessary to increase this for bare metal machines, which take a lot longer to boot due to main memory checks.

However, that is a very good point that we didn't expect. Meaning, we would need to make this configurable. I guess taking the default value from the shoot's configuration is what makes the most sense here. However, should we take the setting from the cluster-autoscaler or from MCM (both have a similar option)? 🤔 I guess we would need the opinion from MCM experts on this.

- SLO Threshold: default TBD based on real world data, but this would be configurable
- Notes:
- We need to implement a histogram metric that doesn't exist at the moment: `mcm_machine_creation_duration_minutes_bucket`
- confirm with MCM experts that the `Pending` state only happens during machine creation.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A machine is also in Pending when the node-critical components are not ready, yet. From end-user perspective, this is still an unusable node, but it is not strictly related to machine-controller-manager.

Copy link
Copy Markdown
Member Author

@etiennnr etiennnr Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the case. See all the possible states here, I think the machine gets either in unknown or Failed state. Would be to confirm though

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My experience so far was not that the machines go from Pending through Unknown/Failed to Ready, but feel free to check. What I saw so far was that the machines stay in Pending for a potentially long period of time even if the machine has already joined the kubernetes cluster.


- SLO Threshold: default TBD based on real world data, but this would be configurable
- Notes:
- For now, we won’t take nodes less than 10 minutes old into account (default wait time for nodes to become ready is 20 minutes).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the discrepancy (10 min vs. 20 min) desired? I understand this metrics rather as how nodes are available after they successfully joined the cluster.
WDYT?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Since machine are replaced after 20 minutes, if we set the threshold to that same amount, we would barely see the metric failing since the machine gets deleted (sometimes in less than 1 minute). Hence, is we were to put 20 or higher, this SLO never trigger.


### Shoot creation latency

- SLI specification: The amount of shoots getting fully created within 30 minutes vs the amount of shoots getting fully created.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the 30 minutes value be configurable? For certain infrastructures, the node creation alone may take longer than this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can probably make that configurable!

Copy link
Copy Markdown
Member

@timebertt timebertt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this proposal.
My team has already set up basic availability monitoring/reporting for the shoot API servers and integrated it into STACKIT's internal availability monitoring for all other products. So I expect that there is potential for collaborating on this topic in general.
However, I'm skeptical whether this should actually be an extension and whether the suggested probes are technically feasible and product-wise sensible.
Most of my concerns were already mentioned inline by the other reviewers, so I refrained from duplicating them.

Comment thread geps/0041-slo-monitoring/README.md
Comment thread geps/0041-slo-monitoring/README.md Outdated
> [!NOTE]
> We are not aiming for perfection for the initial implementation, but rather to have a good starting point that can be improved over time based on real world data and experience. The goal is rather to have realistic and achievable SLOs that reflect the customer's experience and satisfaction in operating their shoot clusters. Hence, after the initial implementation, we should regularly review, adjust and add SLOs based on the data we collect and the feedback we get from customers and operators.

### kube-apiserver general availability
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how https://github.com/gardener-attic/connectivity-exporter is related to this endeavour and if it could be used as part of it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting!!! I didn't even know that this existed!

etiennnr and others added 8 commits February 23, 2026 13:58
Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
Signed-off-by: Etienne Kemp-Rousseau <etienne.kr@hotmail.com>
@gardener-ci-robot
Copy link
Copy Markdown

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 14d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as active with /lifecycle active
  • Mark this PR as fresh with /remove-lifecycle stale
  • Mark this PR as rotten with /lifecycle rotten
  • Close this PR with /close

/lifecycle stale

@gardener-prow gardener-prow Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 26, 2026
@ScheererJ
Copy link
Copy Markdown
Member

/remove-lifecycle stale

@gardener-prow gardener-prow Bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2026
@gardener-ci-robot
Copy link
Copy Markdown

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 14d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as active with /lifecycle active
  • Mark this PR as fresh with /remove-lifecycle stale
  • Mark this PR as rotten with /lifecycle rotten
  • Close this PR with /close

/lifecycle stale

@gardener-prow gardener-prow Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants