diff --git a/hosted_control_planes/hcp-observability.adoc b/hosted_control_planes/hcp-observability.adoc index cdd25a25c5cc..f17016994dc7 100644 --- a/hosted_control_planes/hcp-observability.adoc +++ b/hosted_control_planes/hcp-observability.adoc @@ -9,10 +9,33 @@ toc::[] [role="_abstract"] You can gather metrics for {hcp} by configuring metrics sets. Monitoring dashboards are created in the management cluster for each hosted cluster that it manages. +// Metrics sets include::modules/hosted-control-planes-metrics-sets.adoc[leveloffset=+1] include::modules/hosted-control-planes-monitoring-dashboard.adoc[leveloffset=+1] +// Monitoring CP metrics from the hosted cluster +include::modules/hcp-cp-metrics-overview.adoc[leveloffset=+1] + +include::modules/hcp-cp-metrics-enable.adoc[leveloffset=+2] + +include::modules/hcp-cp-query-metrics.adoc[leveloffset=+2] + +[role="_additional-resources"] +.Additional resources + +* xref:../operators/understanding/olm/olm-understanding-metrics.adoc#olm-metrics_olm-understanding-metrics[Exposed metrics] + +include::modules/hcp-cp-query-metrics-console.adoc[leveloffset=+2] + +[role="_additional-resources"] +.Additional resources + +* xref:../operators/understanding/olm/olm-understanding-metrics.adoc#olm-metrics_olm-understanding-metrics[Exposed metrics] + +include::modules/hcp-cp-metrics-dashboards.adoc[leveloffset=+2] + +//Connectivity metrics include::modules/hcp-connectivity-metrics.adoc[leveloffset=+1] include::modules/hcp-connect-data-plane.adoc[leveloffset=+2] diff --git a/modules/hcp-cp-metrics-dashboards.adoc b/modules/hcp-cp-metrics-dashboards.adoc new file mode 100644 index 000000000000..3d0190f76c73 --- /dev/null +++ b/modules/hcp-cp-metrics-dashboards.adoc @@ -0,0 +1,119 @@ +// Module included in the following assemblies: +// +// * hosted_control_planes/hcp-observability.adoc + +:_mod-docs-content-type: PROCEDURE +[id="hcp-cp-metrics-dashboards_{context}"] += Importing control plane health dashboards + +[role="_abstract"] +You can import a sample Grafana dashboard that visualizes propagated control plane metrics in the hosted cluster web console. The dashboard covers API server, etcd, cluster Operators, scheduler, controller manager, and OLM health panels. + +.Prerequisites + +* Metrics forwarding is enabled and verified. + +* The HyperShift Operator uses `METRICS_SET=All` or `METRICS_SET=SRE` with a matching `sre-metric-set` `ConfigMap` object in the hosted control plane namespace. The default `Telemetry` metrics set forwards only a small metric subset and leaves most dashboard panels empty. + +* You have `cluster-admin` access to the hosted cluster. + +.Procedure + +. Download the sample dashboard JSON file by entering the following command: ++ +[source,terminal] +---- +$ curl -LO https://raw.githubusercontent.com/openshift/hypershift/main/contrib/metrics/guest-control-plane-dashboard.json +---- ++ +[NOTE] +==== +If you deploy user-workload Grafana through the Grafana Operator, import the dashboard JSON as a `GrafanaDashboard` custom resource instead of using a console `ConfigMap` object. +==== + +. Create a `ConfigMap` object from the dashboard file in the `openshift-config-managed` namespace by entering the following command: ++ +[source,terminal] +---- +$ oc create configmap guest-control-plane-dashboard \ + --from-file=guest-control-plane-dashboard.json=guest-control-plane-dashboard.json \ + -n openshift-config-managed +---- + +. Label the `ConfigMap` object so the console discovers it as a dashboard by entering the following command: ++ +[source,terminal] +---- +$ oc label configmap guest-control-plane-dashboard \ + console.openshift.io/dashboard=true \ + -n openshift-config-managed +---- + +. Log in to the web console and click *Observe* -> *Dashboards*. + +. Select the *Hosted Cluster Control Plane* dashboard. + +. Optional: If you use `METRICS_SET=SRE` on the HyperShift Operator, configure the Operator and create or update the `sre-metric-set` `ConfigMap` object in the hosted control plane namespace with relabel configurations that forward the dashboard metric names. ++ +.. Log in to the management cluster and set the metrics set on the HyperShift Operator by entering the following command: ++ +[source,terminal] +---- +$ oc set env -n hypershift deployment/operator METRICS_SET=SRE +---- + +.. Replace `` with your hosted control plane namespace and create the `ConfigMap` object: ++ +[source,yaml] +---- +apiVersion: v1 +kind: `ConfigMap` object +metadata: + name: sre-metric-set + namespace: +data: + config: | + kubeAPIServer: + - action: keep + sourceLabels: ["__name__"] + regex: "(apiserver_request_total|apiserver_request_duration_seconds_bucket|apiserver_current_inflight_requests|apiserver_storage_objects)" + etcd: + - action: keep + sourceLabels: ["__name__"] + regex: "(etcd_mvcc_db_total_size_in_bytes|etcd_mvcc_db_total_size_in_use_in_bytes|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_disk_backend_commit_duration_seconds_bucket|etcd_network_peer_round_trip_time_seconds_bucket|etcd_server_leader_changes_seen_total|etcd_server_has_leader)" + kubeControllerManager: + - action: keep + sourceLabels: ["__name__"] + regex: "(workqueue_depth|workqueue_adds_total)" + kubeScheduler: + - action: keep + sourceLabels: ["__name__"] + regex: "(scheduler_e2e_scheduling_duration_seconds_count|scheduler_schedule_attempts_total|scheduler_pending_pods)" + cvo: + - action: keep + sourceLabels: ["__name__"] + regex: "(cluster_version|cluster_operator_up|cluster_operator_conditions)" + olm: + - action: keep + sourceLabels: ["__name__"] + regex: "(csv_succeeded)" +---- ++ +This configuration forwards 20 metric names across five components that the dashboard uses. ++ +For full `SRE` metrics set configuration, see "Configuring the SRE metrics set". + +.. Apply the `ConfigMap` object on the management cluster: ++ +[source,terminal] +---- +$ oc apply -f sre-metric-set.yaml +---- ++ +The Control Plane Operator detects the `ConfigMap` object change and updates the `metrics-proxy` configuration. + +.Verification + +* The dashboard is displayed under *Observe* -> *Dashboards* in the web console. +* Panels display data when the configured metrics set includes the required metric names. +* The etcd database size panels show current use relative to the 8 GB limit. diff --git a/modules/hcp-cp-metrics-enable.adoc b/modules/hcp-cp-metrics-enable.adoc new file mode 100644 index 000000000000..5477731bfe77 --- /dev/null +++ b/modules/hcp-cp-metrics-enable.adoc @@ -0,0 +1,37 @@ +[#hcp-cp-metrics-enablement_{context}] +// Module included in the following assemblies: +// +// * hosted_control_planes/hcp-observability.adoc + +:_mod-docs-content-type: PROCEDURE +[id="hcp-cp-metrics-enable_{context}"] += Enabling metrics forwarding + +[role="_abstract"] +Enable metrics forwarding so that you can observe hosted control plane health from the hosted cluster monitoring stack. + +If you are a hosted cluster administrator without management cluster access, ask a platform administrator enable metrics forwarding on your `HostedCluster` resource. + +.Prerequisites + +* You are logged in to the management cluster. Alternatively, you can use a `kubeconfig` file with access to the namespace that contains the `HostedCluster` resource. The `HostedCluster` object exists on the management cluster; annotating it from a hosted cluster `kubeconfig` file fails or targets the wrong resource. + +.Procedure + +. Add the `hypershift.openshift.io/enable-metrics-forwarding=true` annotation to the `HostedCluster` resource on the management cluster by entering the following command: ++ +[source,terminal] +---- +$ oc annotate hostedcluster -n \ + hypershift.openshift.io/enable-metrics-forwarding=true +---- ++ +Replace `` with the namespace of the hosted cluster and `` with the name of the hosted cluster. + +. To disable metrics forwarding, remove the annotation by entering the following command: ++ +[source,terminal] +---- +$ oc annotate hostedcluster -n \ + hypershift.openshift.io/enable-metrics-forwarding- +---- diff --git a/modules/hcp-cp-metrics-overview.adoc b/modules/hcp-cp-metrics-overview.adoc new file mode 100644 index 000000000000..8331c6179afa --- /dev/null +++ b/modules/hcp-cp-metrics-overview.adoc @@ -0,0 +1,41 @@ +// Module included in the following assemblies: +// +// * hosted_control_planes/hcp-observability.adoc + +:_mod-docs-content-type: CONCEPT +[id="hcp-cp-metrics-overview_{context}"] += Control plane metrics for {hcp} + +[role="_abstract"] +You can observe hosted control plane health from the hosted cluster monitoring stack when metrics forwarding is enabled. + +With propagated metrics, you can diagnose API server, etcd, Operator, and scheduling issues from the hosted cluster web console and CLI without management cluster credentials. + +This capability is available in {product-title} 4.22 and later. + +Before {product-title} 4.22, control plane components for {hcp} ran on the management cluster and were invisible to the Cluster Monitoring Operator stack in the hosted cluster. Hosted cluster administrators could not query metrics such as `apiserver_request_total`, `etcd_mvcc_db_total_size_in_bytes`, or `csv_succeeded` from the hosted cluster Prometheus. + +With metrics forwarding, selected control plane metrics are propagated from the management cluster into the hosted cluster platform Prometheus. + +After you enable forwarding on the `HostedCluster` resource, you can use familiar PromQL queries, alerts, and dashboards. + +[#hcp-cp-metrics-architecture_{context}] +== Metrics forwarding architecture + +When you enable metrics forwarding, {hcp} deploys components on both the management cluster and the hosted cluster. + +On the management cluster, in the hosted control plane namespace, the following steps take place: + +* The `endpoint-resolver` deployment discovers pod IP addresses for control plane components. +* The `metrics-proxy` deployment scrapes control plane pods, applies per-component metric filters, injects {product-title}-compatible labels, and serves aggregated metrics at paths, such as `/metrics/kube-apiserver` and `/metrics/etcd`, behind a TLS-passthrough Route. + +On the hosted cluster, in the `openshift-monitoring` namespace, the following steps take place: + +* The `control-plane-metrics-forwarder` deployment runs HAProxy and TCP-proxies scrape requests to the management cluster `metrics-proxy` Route. +* A `PodMonitor` named `control-plane-metrics-forwarder` configures platform Prometheus to scrape the forwarder using mutual TLS (mTLS). + +The data path is as follows: + +. Platform Prometheus in the hosted cluster discovers the `PodMonitor` and scrapes the metrics-forwarder. +. The metrics-forwarder forwards the scrape over mTLS to the management cluster `metrics-proxy` Route. +. The metrics-proxy scrapes control plane pods through the endpoint-resolver and returns filtered, relabeled metrics. diff --git a/modules/hcp-cp-query-metrics-console.adoc b/modules/hcp-cp-query-metrics-console.adoc new file mode 100644 index 000000000000..e8131ae0d78f --- /dev/null +++ b/modules/hcp-cp-query-metrics-console.adoc @@ -0,0 +1,81 @@ +// Module included in the following assemblies: +// +// * hosted_control_planes/hcp-observability.adoc + +:_mod-docs-content-type: PROCEDURE +[id="hcp-cp-query-metrics-console_{context}"] += Querying control plane metrics in hosted clusters by using the web console + +[role="_abstract"] +After you enable metrics forwarding, you can verify that control plane metrics are ingested and query them from the web console. + +Use the same PromQL patterns as standalone {product-title} clusters because the metrics-proxy injects compatible labels. + +.Prerequisites + +* Metrics forwarding is enabled on the `HostedCluster` resource. +For enablement steps, see "Enabling metrics forwarding". + +* You have `cluster-admin` access to the hosted cluster. + +* At least two minutes have elapsed since you enabled forwarding so Prometheus can complete initial scrapes. + +.Procedure + +. Log in to the {product-title} web console for the hosted cluster. + +. Click *Observe* -> *Metrics*. + +. In the query field, enter a PromQL expression and run the query. ++ +Use the following examples: ++ +*Operator health*: list CSVs that are not in the `Succeeded` state: ++ +[source,plaintext] +---- +csv_succeeded{job="olm-operator-metrics"} == 0 +---- ++ +*API server request rate*: ++ +[source,plaintext] +---- +sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (verb, code) +---- ++ +*Scheduler activity* ({product-title} 4.22 and later with metrics forwarding enabled): ++ +[source,plaintext] +---- +sum(rate(scheduler_schedule_attempts_total[5m])) by (result) +---- ++ +*Workload-oriented API saturation*: ++ +[source,plaintext] +---- +apiserver_current_inflight_requests{job="apiserver"} +---- ++ +*Scheduling backlog*: ++ +[source,plaintext] +---- +scheduler_pending_pods +---- ++ +*Controller workqueue depth*: ++ +[source,plaintext] +---- +workqueue_depth{job="kube-controller-manager"} +---- ++ +For `csv_succeeded` and other OLM metrics, see "Exposed metrics". + +.Verification + +* Prometheus targets for `control-plane-metrics-forwarder` scrape pools report the `health: up` status. +* PromQL queries for `apiserver_request_total{job="apiserver"}` return nonzero results. +* Example queries in the web console return time series for enabled components. diff --git a/modules/hcp-cp-query-metrics.adoc b/modules/hcp-cp-query-metrics.adoc new file mode 100644 index 000000000000..f49500bdff3b --- /dev/null +++ b/modules/hcp-cp-query-metrics.adoc @@ -0,0 +1,87 @@ +// Module included in the following assemblies: +// +// * hosted_control_planes/hcp-observability.adoc + +:_mod-docs-content-type: PROCEDURE +[id="hcp-cp-query-metrics_{context}"] += Querying control plane metrics in hosted clusters by using the CLI + +[role="_abstract"] +After you enable metrics forwarding, you can verify that control plane metrics are ingested and query them from the CLI. + +Use the same PromQL patterns as standalone {product-title} clusters because the metrics-proxy injects compatible labels. + +.Prerequisites + +* Metrics forwarding is enabled on the `HostedCluster` resource. +For enablement steps, see "Enabling metrics forwarding". + +* You have `cluster-admin` access to the hosted cluster. + +* At least two minutes have elapsed since you enabled forwarding so Prometheus can complete initial scrapes. + +.Procedure + +. Verify that the `control-plane-metrics-forwarder` deployment exists in the `openshift-monitoring` namespace: ++ +[source,terminal] +---- +$ oc get deployment control-plane-metrics-forwarder -n openshift-monitoring +---- ++ +[NOTE] +==== +Control plane metrics are available when the Cluster Monitoring Operator and platform Prometheus are running, even if no compute nodes are scheduled. +Data-plane node and workload metrics still require compute nodes. +==== + +. Verify that the `control-plane-metrics-forwarder` `PodMonitor` exists: ++ +[source,terminal] +---- +$ oc get podmonitor control-plane-metrics-forwarder -n openshift-monitoring +---- + +. Optional: Verify that management-cluster components are running by logging in to the management cluster and replacing `` with the namespace for your hosted cluster. Typically, the format of the namespace is `-`. ++ +.. Enter the following command: ++ +[source,terminal] +---- +$ oc get deployment endpoint-resolver metrics-proxy -n +---- ++ +.. Enter the following command: ++ +[source,terminal] +---- +$ oc get route metrics-proxy -n +---- + +. Verify that Prometheus scraped targets for the forwarder report: ++ +[source,terminal] +---- +$ oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- \ + curl -s http://localhost:9090/api/v1/targets \ + | jq '.data.activeTargets[] | select(.scrapePool | contains("control-plane-metrics-forwarder")) | {scrapePool, scrapeUrl: .scrapeUrl, health}' +---- ++ +You should see one target per forwarded component with the status of `"health": "up"`. + +. Confirm that Kubernetes API server metrics are ingested by querying `apiserver_request_total`: ++ +[source,terminal] +---- +$ oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- \ + curl -gs 'http://localhost:9090/api/v1/query?query=apiserver_request_total{job="apiserver"}' \ + | jq '.data.result | length' +---- ++ +A nonzero result confirms that API server metrics are available in the guest cluster monitoring stack. + +.Verification + +* Prometheus targets for `control-plane-metrics-forwarder` scrape pools report the `health: up` status. +* PromQL queries for `apiserver_request_total{job="apiserver"}` return nonzero results. +* Example queries in the web console return time series for enabled components.