Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions hosted_control_planes/hcp-observability.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,33 @@ toc::[]
[role="_abstract"]
You can gather metrics for {hcp} by configuring metrics sets. Monitoring dashboards are created in the management cluster for each hosted cluster that it manages.

// Metrics sets
include::modules/hosted-control-planes-metrics-sets.adoc[leveloffset=+1]

include::modules/hosted-control-planes-monitoring-dashboard.adoc[leveloffset=+1]

// Monitoring CP metrics from the hosted cluster
include::modules/hcp-cp-metrics-overview.adoc[leveloffset=+1]

include::modules/hcp-cp-metrics-enable.adoc[leveloffset=+2]

include::modules/hcp-cp-query-metrics.adoc[leveloffset=+2]

[role="_additional-resources"]
.Additional resources

* xref:../operators/understanding/olm/olm-understanding-metrics.adoc#olm-metrics_olm-understanding-metrics[Exposed metrics]

include::modules/hcp-cp-query-metrics-console.adoc[leveloffset=+2]

[role="_additional-resources"]
.Additional resources

* xref:../operators/understanding/olm/olm-understanding-metrics.adoc#olm-metrics_olm-understanding-metrics[Exposed metrics]

include::modules/hcp-cp-metrics-dashboards.adoc[leveloffset=+2]

//Connectivity metrics
include::modules/hcp-connectivity-metrics.adoc[leveloffset=+1]

include::modules/hcp-connect-data-plane.adoc[leveloffset=+2]
Expand Down
119 changes: 119 additions & 0 deletions modules/hcp-cp-metrics-dashboards.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
// Module included in the following assemblies:
//
// * hosted_control_planes/hcp-observability.adoc

:_mod-docs-content-type: PROCEDURE
[id="hcp-cp-metrics-dashboards_{context}"]
= Importing control plane health dashboards

[role="_abstract"]
You can import a sample Grafana dashboard that visualizes propagated control plane metrics in the hosted cluster web console. The dashboard covers API server, etcd, cluster Operators, scheduler, controller manager, and OLM health panels.

.Prerequisites

* Metrics forwarding is enabled and verified.

* The HyperShift Operator uses `METRICS_SET=All` or `METRICS_SET=SRE` with a matching `sre-metric-set` `ConfigMap` object in the hosted control plane namespace. The default `Telemetry` metrics set forwards only a small metric subset and leaves most dashboard panels empty.

* You have `cluster-admin` access to the hosted cluster.

.Procedure

. Download the sample dashboard JSON file by entering the following command:
+
[source,terminal]
----
$ curl -LO https://raw.githubusercontent.com/openshift/hypershift/main/contrib/metrics/guest-control-plane-dashboard.json
----
+
[NOTE]
====
If you deploy user-workload Grafana through the Grafana Operator, import the dashboard JSON as a `GrafanaDashboard` custom resource instead of using a console `ConfigMap` object.
====

. Create a `ConfigMap` object from the dashboard file in the `openshift-config-managed` namespace by entering the following command:
+
[source,terminal]
----
$ oc create configmap guest-control-plane-dashboard \
--from-file=guest-control-plane-dashboard.json=guest-control-plane-dashboard.json \
-n openshift-config-managed
----

. Label the `ConfigMap` object so the console discovers it as a dashboard by entering the following command:
+
[source,terminal]
----
$ oc label configmap guest-control-plane-dashboard \
console.openshift.io/dashboard=true \
-n openshift-config-managed
----

. Log in to the web console and click *Observe* -> *Dashboards*.

. Select the *Hosted Cluster Control Plane* dashboard.

. Optional: If you use `METRICS_SET=SRE` on the HyperShift Operator, configure the Operator and create or update the `sre-metric-set` `ConfigMap` object in the hosted control plane namespace with relabel configurations that forward the dashboard metric names.
+
.. Log in to the management cluster and set the metrics set on the HyperShift Operator by entering the following command:
+
[source,terminal]
----
$ oc set env -n hypershift deployment/operator METRICS_SET=SRE
----

.. Replace `<hcp_namespace>` with your hosted control plane namespace and create the `ConfigMap` object:
+
[source,yaml]
----
apiVersion: v1
kind: `ConfigMap` object
metadata:
name: sre-metric-set
namespace: <hcp_namespace>
data:
config: |
kubeAPIServer:
- action: keep
sourceLabels: ["__name__"]
regex: "(apiserver_request_total|apiserver_request_duration_seconds_bucket|apiserver_current_inflight_requests|apiserver_storage_objects)"
etcd:
- action: keep
sourceLabels: ["__name__"]
regex: "(etcd_mvcc_db_total_size_in_bytes|etcd_mvcc_db_total_size_in_use_in_bytes|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_disk_backend_commit_duration_seconds_bucket|etcd_network_peer_round_trip_time_seconds_bucket|etcd_server_leader_changes_seen_total|etcd_server_has_leader)"
kubeControllerManager:
- action: keep
sourceLabels: ["__name__"]
regex: "(workqueue_depth|workqueue_adds_total)"
kubeScheduler:
- action: keep
sourceLabels: ["__name__"]
regex: "(scheduler_e2e_scheduling_duration_seconds_count|scheduler_schedule_attempts_total|scheduler_pending_pods)"
cvo:
- action: keep
sourceLabels: ["__name__"]
regex: "(cluster_version|cluster_operator_up|cluster_operator_conditions)"
olm:
- action: keep
sourceLabels: ["__name__"]
regex: "(csv_succeeded)"
----
+
This configuration forwards 20 metric names across five components that the dashboard uses.
+
For full `SRE` metrics set configuration, see "Configuring the SRE metrics set".

.. Apply the `ConfigMap` object on the management cluster:
+
[source,terminal]
----
$ oc apply -f sre-metric-set.yaml
----
+
The Control Plane Operator detects the `ConfigMap` object change and updates the `metrics-proxy` configuration.

.Verification

* The dashboard is displayed under *Observe* -> *Dashboards* in the web console.
* Panels display data when the configured metrics set includes the required metric names.
* The etcd database size panels show current use relative to the 8 GB limit.
37 changes: 37 additions & 0 deletions modules/hcp-cp-metrics-enable.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
[#hcp-cp-metrics-enablement_{context}]
// Module included in the following assemblies:
//
// * hosted_control_planes/hcp-observability.adoc

:_mod-docs-content-type: PROCEDURE
[id="hcp-cp-metrics-enable_{context}"]
= Enabling metrics forwarding

[role="_abstract"]
Enable metrics forwarding so that you can observe hosted control plane health from the hosted cluster monitoring stack.

If you are a hosted cluster administrator without management cluster access, ask a platform administrator enable metrics forwarding on your `HostedCluster` resource.

.Prerequisites

* You are logged in to the management cluster. Alternatively, you can use a `kubeconfig` file with access to the namespace that contains the `HostedCluster` resource. The `HostedCluster` object exists on the management cluster; annotating it from a hosted cluster `kubeconfig` file fails or targets the wrong resource.

.Procedure

. Add the `hypershift.openshift.io/enable-metrics-forwarding=true` annotation to the `HostedCluster` resource on the management cluster by entering the following command:
+
[source,terminal]
----
$ oc annotate hostedcluster -n <hosted_cluster_namespace> <hosted_cluster_name> \
hypershift.openshift.io/enable-metrics-forwarding=true
----
+
Replace `<hosted_cluster_namespace>` with the namespace of the hosted cluster and `<hosted_cluster_name>` with the name of the hosted cluster.

. To disable metrics forwarding, remove the annotation by entering the following command:
+
[source,terminal]
----
$ oc annotate hostedcluster -n <hosted_cluster_namespace> <hosted_cluster_name> \
hypershift.openshift.io/enable-metrics-forwarding-
----
41 changes: 41 additions & 0 deletions modules/hcp-cp-metrics-overview.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
// Module included in the following assemblies:
//
// * hosted_control_planes/hcp-observability.adoc

:_mod-docs-content-type: CONCEPT
[id="hcp-cp-metrics-overview_{context}"]
= Control plane metrics for {hcp}

[role="_abstract"]
You can observe hosted control plane health from the hosted cluster monitoring stack when metrics forwarding is enabled.

With propagated metrics, you can diagnose API server, etcd, Operator, and scheduling issues from the hosted cluster web console and CLI without management cluster credentials.

This capability is available in {product-title} 4.22 and later.

Before {product-title} 4.22, control plane components for {hcp} ran on the management cluster and were invisible to the Cluster Monitoring Operator stack in the hosted cluster. Hosted cluster administrators could not query metrics such as `apiserver_request_total`, `etcd_mvcc_db_total_size_in_bytes`, or `csv_succeeded` from the hosted cluster Prometheus.

With metrics forwarding, selected control plane metrics are propagated from the management cluster into the hosted cluster platform Prometheus.

After you enable forwarding on the `HostedCluster` resource, you can use familiar PromQL queries, alerts, and dashboards.

[#hcp-cp-metrics-architecture_{context}]
== Metrics forwarding architecture

When you enable metrics forwarding, {hcp} deploys components on both the management cluster and the hosted cluster.

On the management cluster, in the hosted control plane namespace, the following steps take place:

* The `endpoint-resolver` deployment discovers pod IP addresses for control plane components.
* The `metrics-proxy` deployment scrapes control plane pods, applies per-component metric filters, injects {product-title}-compatible labels, and serves aggregated metrics at paths, such as `/metrics/kube-apiserver` and `/metrics/etcd`, behind a TLS-passthrough Route.

On the hosted cluster, in the `openshift-monitoring` namespace, the following steps take place:

* The `control-plane-metrics-forwarder` deployment runs HAProxy and TCP-proxies scrape requests to the management cluster `metrics-proxy` Route.
* A `PodMonitor` named `control-plane-metrics-forwarder` configures platform Prometheus to scrape the forwarder using mutual TLS (mTLS).

The data path is as follows:

. Platform Prometheus in the hosted cluster discovers the `PodMonitor` and scrapes the metrics-forwarder.
. The metrics-forwarder forwards the scrape over mTLS to the management cluster `metrics-proxy` Route.
. The metrics-proxy scrapes control plane pods through the endpoint-resolver and returns filtered, relabeled metrics.
81 changes: 81 additions & 0 deletions modules/hcp-cp-query-metrics-console.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
// Module included in the following assemblies:
//
// * hosted_control_planes/hcp-observability.adoc

:_mod-docs-content-type: PROCEDURE
[id="hcp-cp-query-metrics-console_{context}"]
= Querying control plane metrics in hosted clusters by using the web console

[role="_abstract"]
After you enable metrics forwarding, you can verify that control plane metrics are ingested and query them from the web console.

Use the same PromQL patterns as standalone {product-title} clusters because the metrics-proxy injects compatible labels.

.Prerequisites

* Metrics forwarding is enabled on the `HostedCluster` resource.
For enablement steps, see "Enabling metrics forwarding".

* You have `cluster-admin` access to the hosted cluster.

* At least two minutes have elapsed since you enabled forwarding so Prometheus can complete initial scrapes.

.Procedure

. Log in to the {product-title} web console for the hosted cluster.

. Click *Observe* -> *Metrics*.

. In the query field, enter a PromQL expression and run the query.
+
Use the following examples:
+
*Operator health*: list CSVs that are not in the `Succeeded` state:
+
[source,plaintext]
----
csv_succeeded{job="olm-operator-metrics"} == 0
----
+
*API server request rate*:
+
[source,plaintext]
----
sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (verb, code)
----
+
*Scheduler activity* ({product-title} 4.22 and later with metrics forwarding enabled):
+
[source,plaintext]
----
sum(rate(scheduler_schedule_attempts_total[5m])) by (result)
----
+
*Workload-oriented API saturation*:
+
[source,plaintext]
----
apiserver_current_inflight_requests{job="apiserver"}
----
+
*Scheduling backlog*:
+
[source,plaintext]
----
scheduler_pending_pods
----
+
*Controller workqueue depth*:
+
[source,plaintext]
----
workqueue_depth{job="kube-controller-manager"}
----
+
For `csv_succeeded` and other OLM metrics, see "Exposed metrics".

.Verification

* Prometheus targets for `control-plane-metrics-forwarder` scrape pools report the `health: up` status.
* PromQL queries for `apiserver_request_total{job="apiserver"}` return nonzero results.
* Example queries in the web console return time series for enabled components.
Loading