CNTRLPLANE-3584: Add RBAC for kube-scheduler /metrics/resources endpoint#8680
CNTRLPLANE-3584: Add RBAC for kube-scheduler /metrics/resources endpoint#8680dhgautam99 wants to merge 3 commits into
Conversation
…erts The kube-scheduler previously auto-generated self-signed serving certificates via --cert-dir. This change adds a CA-signed serving certificate, a Service, and a ServiceMonitor to enable Prometheus metrics scraping with proper mTLS authentication.
Add unit tests for scheduler component options, ServiceMonitor adapter, and server certificate reconciliation. Regenerate test fixtures after adding Service, ServiceMonitor, and CA-signed serving certificate support.
The built-in system:monitoring ClusterRole only grants access to /metrics. This adds a ClusterRole and ClusterRoleBinding in the guest cluster to allow the metrics client to scrape /metrics/resources on kube-scheduler, which exposes kube_pod_resource_request and kube_pod_resource_limit metrics.
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@dhgautam99: This pull request references CNTRLPLANE-3584 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
📝 WalkthroughWalkthroughThis PR enables TLS serving on kube-scheduler and metrics scraping via HTTPS. It adds scheduler server certificate PKI reconciliation, updates the Deployment to use explicit TLS certificate files, exposes a new service port for secure client communication, configures a ServiceMonitor for Prometheus metrics collection with TLS, and establishes the necessary metrics relabel configuration and RBAC permissions for the Prometheus instance to scrape scheduler metrics. 🚥 Pre-merge checks | ✅ 11✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dhgautam99 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8680 +/- ##
==========================================
+ Coverage 40.69% 41.43% +0.74%
==========================================
Files 755 758 +3
Lines 93373 93718 +345
==========================================
+ Hits 37994 38831 +837
+ Misses 52646 52165 -481
+ Partials 2733 2722 -11
... and 46 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
I have all the information needed. Here is the analysis: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe Root CauseThe PR introduces new Go functions that lack corresponding unit tests:
The PR does include good test coverage for some components — Codecov uses the project-level coverage (40.69%) as the default patch target when no explicit Recommendations
Evidence
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
control-plane-operator/hostedclusterconfigoperator/controllers/resources/rbac/reconcile.go (1)
294-298: ⚡ Quick winUse manifest-derived role name to prevent drift.
Line 297 hardcodes
"hypershift-metrics-resources-reader"instead of referencing the manifest constructor. If either side changes, role binding will silently point to the wrong role.💡 Proposed fix
r.RoleRef = rbacv1.RoleRef{ APIGroup: rbacv1.SchemeGroupVersion.Group, Kind: "ClusterRole", - Name: "hypershift-metrics-resources-reader", + Name: hccomanifests.MetricsResourcesClusterRole().Name, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-operator/hostedclusterconfigoperator/controllers/resources/rbac/reconcile.go` around lines 294 - 298, Replace the hardcoded RoleRef.Name string in the r.RoleRef assignment with the canonical name produced by the manifest constructor that builds the corresponding ClusterRole (the same code that currently constructs "hypershift-metrics-resources-reader"); locate that manifest constructor or exported name/constant used when creating the ClusterRole and use it to set r.RoleRef.Name in the r.RoleRef assignment so the RoleBinding always references the manifest-derived role name and cannot drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kube_scheduler/servicemonitor.go`:
- Around line 11-16: The adaptServiceMonitor function currently indexes
sm.Spec.Endpoints[0] without verifying endpoints exist, which can panic; update
adaptServiceMonitor to first check len(sm.Spec.Endpoints) > 0 and return a clear
error (or create/append a default endpoint if desired) when it is zero, then
only set MetricRelabelConfigs and call util.ApplyClusterIDLabel on
sm.Spec.Endpoints[0]; reference the adaptServiceMonitor function,
sm.Spec.Endpoints, MetricRelabelConfigs, and util.ApplyClusterIDLabel when
making the change.
In `@support/metrics/sets.go`:
- Around line 229-237: The Telemetry branch in SchedulerRelabelConfigs
incorrectly returns sreMetricsSetConfig.KubeScheduler causing Telemetry to
depend on the SRE config; change the MetricsSetTelemetry case to return the
Telemetry-specific relabel config (e.g., telemetryMetricsSetConfig.KubeScheduler
or a dedicated telemetry config variable) instead of
sreMetricsSetConfig.KubeScheduler, ensuring MetricsSetSRE still returns
sreMetricsSetConfig.KubeScheduler and the default returns nil.
---
Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/rbac/reconcile.go`:
- Around line 294-298: Replace the hardcoded RoleRef.Name string in the
r.RoleRef assignment with the canonical name produced by the manifest
constructor that builds the corresponding ClusterRole (the same code that
currently constructs "hypershift-metrics-resources-reader"); locate that
manifest constructor or exported name/constant used when creating the
ClusterRole and use it to set r.RoleRef.Name in the r.RoleRef assignment so the
RoleBinding always references the manifest-derived role name and cannot drift.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 7d14cef1-a721-4b03-b73a-d3d4e5886071
⛔ Files ignored due to path filters (20)
control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/AROSwift/zz_fixture_TestControlPlaneComponents_kube_scheduler_controlplanecomponent.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/AROSwift/zz_fixture_TestControlPlaneComponents_kube_scheduler_deployment.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/AROSwift/zz_fixture_TestControlPlaneComponents_kube_scheduler_service.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/AROSwift/zz_fixture_TestControlPlaneComponents_kube_scheduler_servicemonitor.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/GCP/zz_fixture_TestControlPlaneComponents_kube_scheduler_controlplanecomponent.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/GCP/zz_fixture_TestControlPlaneComponents_kube_scheduler_deployment.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/GCP/zz_fixture_TestControlPlaneComponents_kube_scheduler_service.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/GCP/zz_fixture_TestControlPlaneComponents_kube_scheduler_servicemonitor.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/IBMCloud/zz_fixture_TestControlPlaneComponents_kube_scheduler_controlplanecomponent.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/IBMCloud/zz_fixture_TestControlPlaneComponents_kube_scheduler_deployment.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/IBMCloud/zz_fixture_TestControlPlaneComponents_kube_scheduler_service.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/IBMCloud/zz_fixture_TestControlPlaneComponents_kube_scheduler_servicemonitor.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_kube_scheduler_controlplanecomponent.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_kube_scheduler_deployment.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_kube_scheduler_service.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_kube_scheduler_servicemonitor.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/zz_fixture_TestControlPlaneComponents_kube_scheduler_controlplanecomponent.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/zz_fixture_TestControlPlaneComponents_kube_scheduler_deployment.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/zz_fixture_TestControlPlaneComponents_kube_scheduler_service.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/zz_fixture_TestControlPlaneComponents_kube_scheduler_servicemonitor.yamlis excluded by!**/testdata/**
📒 Files selected for processing (15)
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.gocontrol-plane-operator/controllers/hostedcontrolplane/manifests/pki.gocontrol-plane-operator/controllers/hostedcontrolplane/pki/scheduler.gocontrol-plane-operator/controllers/hostedcontrolplane/pki/scheduler_test.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/assets/kube-scheduler/deployment.yamlcontrol-plane-operator/controllers/hostedcontrolplane/v2/assets/kube-scheduler/service.yamlcontrol-plane-operator/controllers/hostedcontrolplane/v2/assets/kube-scheduler/servicemonitor.yamlcontrol-plane-operator/controllers/hostedcontrolplane/v2/kube_scheduler/component.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kube_scheduler/component_test.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kube_scheduler/servicemonitor.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/kube_scheduler/servicemonitor_test.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/manifests/rbac.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/rbac/reconcile.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.gosupport/metrics/sets.go
| func adaptServiceMonitor(cpContext component.WorkloadContext, sm *prometheusoperatorv1.ServiceMonitor) error { | ||
| sm.Spec.NamespaceSelector = prometheusoperatorv1.NamespaceSelector{ | ||
| MatchNames: []string{sm.Namespace}, | ||
| } | ||
| sm.Spec.Endpoints[0].MetricRelabelConfigs = metrics.SchedulerRelabelConfigs(cpContext.MetricsSet) | ||
| util.ApplyClusterIDLabel(&sm.Spec.Endpoints[0], cpContext.HCP.Spec.ClusterID) |
There was a problem hiding this comment.
Guard endpoint indexing to avoid reconcile panic.
Line 15 and Line 16 dereference sm.Spec.Endpoints[0] without validating length. A zero-endpoint manifest will panic the controller instead of returning a reconcile error.
💡 Proposed fix
import (
+ "fmt"
+
component "github.com/openshift/hypershift/support/controlplane-component"
"github.com/openshift/hypershift/support/metrics"
"github.com/openshift/hypershift/support/util"
@@
func adaptServiceMonitor(cpContext component.WorkloadContext, sm *prometheusoperatorv1.ServiceMonitor) error {
sm.Spec.NamespaceSelector = prometheusoperatorv1.NamespaceSelector{
MatchNames: []string{sm.Namespace},
}
+ if len(sm.Spec.Endpoints) == 0 {
+ return fmt.Errorf("kube-scheduler ServiceMonitor must define at least one endpoint")
+ }
sm.Spec.Endpoints[0].MetricRelabelConfigs = metrics.SchedulerRelabelConfigs(cpContext.MetricsSet)
util.ApplyClusterIDLabel(&sm.Spec.Endpoints[0], cpContext.HCP.Spec.ClusterID)As per coding guidelines: "Avoid panics in Go except in truly unrecoverable cases."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func adaptServiceMonitor(cpContext component.WorkloadContext, sm *prometheusoperatorv1.ServiceMonitor) error { | |
| sm.Spec.NamespaceSelector = prometheusoperatorv1.NamespaceSelector{ | |
| MatchNames: []string{sm.Namespace}, | |
| } | |
| sm.Spec.Endpoints[0].MetricRelabelConfigs = metrics.SchedulerRelabelConfigs(cpContext.MetricsSet) | |
| util.ApplyClusterIDLabel(&sm.Spec.Endpoints[0], cpContext.HCP.Spec.ClusterID) | |
| import ( | |
| "fmt" | |
| component "github.com/openshift/hypershift/support/controlplane-component" | |
| "github.com/openshift/hypershift/support/metrics" | |
| "github.com/openshift/hypershift/support/util" | |
| prometheusoperatorv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1" | |
| ) | |
| func adaptServiceMonitor(cpContext component.WorkloadContext, sm *prometheusoperatorv1.ServiceMonitor) error { | |
| sm.Spec.NamespaceSelector = prometheusoperatorv1.NamespaceSelector{ | |
| MatchNames: []string{sm.Namespace}, | |
| } | |
| if len(sm.Spec.Endpoints) == 0 { | |
| return fmt.Errorf("kube-scheduler ServiceMonitor must define at least one endpoint") | |
| } | |
| sm.Spec.Endpoints[0].MetricRelabelConfigs = metrics.SchedulerRelabelConfigs(cpContext.MetricsSet) | |
| util.ApplyClusterIDLabel(&sm.Spec.Endpoints[0], cpContext.HCP.Spec.ClusterID) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kube_scheduler/servicemonitor.go`
around lines 11 - 16, The adaptServiceMonitor function currently indexes
sm.Spec.Endpoints[0] without verifying endpoints exist, which can panic; update
adaptServiceMonitor to first check len(sm.Spec.Endpoints) > 0 and return a clear
error (or create/append a default endpoint if desired) when it is zero, then
only set MetricRelabelConfigs and call util.ApplyClusterIDLabel on
sm.Spec.Endpoints[0]; reference the adaptServiceMonitor function,
sm.Spec.Endpoints, MetricRelabelConfigs, and util.ApplyClusterIDLabel when
making the change.
| func SchedulerRelabelConfigs(set MetricsSet) []prometheusoperatorv1.RelabelConfig { | ||
| switch set { | ||
| case MetricsSetTelemetry: | ||
| return sreMetricsSetConfig.KubeScheduler | ||
| case MetricsSetSRE: | ||
| return sreMetricsSetConfig.KubeScheduler | ||
| default: | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Telemetry branch is incorrectly coupled to SRE config state.
Line 232 returns sreMetricsSetConfig.KubeScheduler for MetricsSetTelemetry, which makes Telemetry output depend on the SRE ConfigMap content. This can silently alter Telemetry scrape behavior whenever SRE config is present.
💡 Proposed fix
func SchedulerRelabelConfigs(set MetricsSet) []prometheusoperatorv1.RelabelConfig {
switch set {
case MetricsSetTelemetry:
- return sreMetricsSetConfig.KubeScheduler
+ return nil
case MetricsSetSRE:
return sreMetricsSetConfig.KubeScheduler
default:
return nil
}
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@support/metrics/sets.go` around lines 229 - 237, The Telemetry branch in
SchedulerRelabelConfigs incorrectly returns sreMetricsSetConfig.KubeScheduler
causing Telemetry to depend on the SRE config; change the MetricsSetTelemetry
case to return the Telemetry-specific relabel config (e.g.,
telemetryMetricsSetConfig.KubeScheduler or a dedicated telemetry config
variable) instead of sreMetricsSetConfig.KubeScheduler, ensuring MetricsSetSRE
still returns sreMetricsSetConfig.KubeScheduler and the default returns nil.
What this PR does / why we need it:
Adds guest-cluster RBAC to allow the
metrics-clientcertificate identity (system:serviceaccount:hypershift:prometheus) to access the kube-scheduler/metrics/resourcesendpoint.The built-in
system:monitoringClusterRole only grants GET on/metricsand/metrics/slis. The/metrics/resourcesendpoint (KEP-1748, exposingkube_pod_resource_requestandkube_pod_resource_limit) requires a separate authorization grant. Without this, prometheus-user-workload on the management cluster receives a 403 Forbidden when scraping/metrics/resourcesvia the ServiceMonitor.This PR creates a new
hypershift-metrics-resources-readerClusterRole and ClusterRoleBinding in the guest cluster via HCCO, granting GET on/metrics/resourcesto the metrics-client identity.Which issue(s) this PR fixes:
Fixes CNTRLPLANE-3584
Depends on #8489
Special notes for your reviewer:
system:monitoringClusterRoleBinding--authorization-kubeconfigpointing to the guest cluster KAS, so RBAC checks for metrics endpoints happen on the guest cluster even though the scheduler runs on the management cluster/metrics/resourcesendpoint only returns data on the leader scheduler podChecklist:
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests