Skip to content

CNTRLPLANE-3584: Add RBAC for kube-scheduler /metrics/resources endpoint#8680

Draft
dhgautam99 wants to merge 3 commits into
openshift:mainfrom
dhgautam99:enable-kube-scheduler-resource-metrics
Draft

CNTRLPLANE-3584: Add RBAC for kube-scheduler /metrics/resources endpoint#8680
dhgautam99 wants to merge 3 commits into
openshift:mainfrom
dhgautam99:enable-kube-scheduler-resource-metrics

Conversation

@dhgautam99
Copy link
Copy Markdown

@dhgautam99 dhgautam99 commented Jun 5, 2026

What this PR does / why we need it:

Adds guest-cluster RBAC to allow the metrics-client certificate identity (system:serviceaccount:hypershift:prometheus) to access the kube-scheduler /metrics/resources endpoint.

The built-in system:monitoring ClusterRole only grants GET on /metrics and /metrics/slis. The /metrics/resources endpoint (KEP-1748, exposing kube_pod_resource_request and kube_pod_resource_limit) requires a separate authorization grant. Without this, prometheus-user-workload on the management cluster receives a 403 Forbidden when scraping /metrics/resources via the ServiceMonitor.

This PR creates a new hypershift-metrics-resources-reader ClusterRole and ClusterRoleBinding in the guest cluster via HCCO, granting GET on /metrics/resources to the metrics-client identity.

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-3584

Depends on #8489

Special notes for your reviewer:

  • The RBAC is reconciled by HCCO (Hosted Cluster Config Operator) in the guest cluster, following the same pattern as the existing system:monitoring ClusterRoleBinding
  • kube-scheduler uses --authorization-kubeconfig pointing to the guest cluster KAS, so RBAC checks for metrics endpoints happen on the guest cluster even though the scheduler runs on the management cluster
  • The /metrics/resources endpoint only returns data on the leader scheduler pod

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Kube-scheduler now uses TLS certificates for secure client connections on port 10259.
    • Kube-scheduler metrics are now exposed via HTTPS monitoring with ServiceMonitor configuration.
    • Added RBAC resources granting prometheus metrics resource access.
  • Tests

    • Added tests for scheduler server certificate reconciliation, idempotency, and ServiceMonitor configuration.

…erts

The kube-scheduler previously auto-generated self-signed serving
certificates via --cert-dir. This change adds a CA-signed serving
certificate, a Service, and a ServiceMonitor to enable Prometheus
metrics scraping with proper mTLS authentication.
Add unit tests for scheduler component options, ServiceMonitor adapter,
and server certificate reconciliation. Regenerate test fixtures after
adding Service, ServiceMonitor, and CA-signed serving certificate support.
The built-in system:monitoring ClusterRole only grants access to /metrics.
This adds a ClusterRole and ClusterRoleBinding in the guest cluster to
allow the metrics client to scrape /metrics/resources on kube-scheduler,
which exposes kube_pod_resource_request and kube_pod_resource_limit metrics.
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 5, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Jun 5, 2026

@dhgautam99: This pull request references CNTRLPLANE-3584 which is a valid jira issue.

Details

In response to this:

What this PR does / why we need it:

Adds guest-cluster RBAC to allow the metrics-client certificate identity (system:serviceaccount:hypershift:prometheus) to access the kube-scheduler /metrics/resources endpoint.

The built-in system:monitoring ClusterRole only grants GET on /metrics and /metrics/slis. The /metrics/resources endpoint (KEP-1748, exposing kube_pod_resource_request and kube_pod_resource_limit) requires a separate authorization grant. Without this, prometheus-user-workload on the management cluster receives a 403 Forbidden when scraping /metrics/resources via the ServiceMonitor.

This PR creates a new hypershift-metrics-resources-reader ClusterRole and ClusterRoleBinding in the guest cluster via HCCO, granting GET on /metrics/resources to the metrics-client identity.

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-3584

Depends on #8489

Special notes for your reviewer:

  • The RBAC is reconciled by HCCO (Hosted Cluster Config Operator) in the guest cluster, following the same pattern as the existing system:monitoring ClusterRoleBinding
  • kube-scheduler uses --authorization-kubeconfig pointing to the guest cluster KAS, so RBAC checks for metrics endpoints happen on the guest cluster even though the scheduler runs on the management cluster
  • The /metrics/resources endpoint only returns data on the leader scheduler pod

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

📝 Walkthrough

Walkthrough

This PR enables TLS serving on kube-scheduler and metrics scraping via HTTPS. It adds scheduler server certificate PKI reconciliation, updates the Deployment to use explicit TLS certificate files, exposes a new service port for secure client communication, configures a ServiceMonitor for Prometheus metrics collection with TLS, and establishes the necessary metrics relabel configuration and RBAC permissions for the Prometheus instance to scrape scheduler metrics.

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately identifies the main feature: adding RBAC for kube-scheduler /metrics/resources endpoint, which aligns with the primary objective of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All new test names use static, descriptive strings with no dynamic content, generated suffixes, timestamps, UUIDs, or other unstable values.
Test Structure And Quality ✅ Passed Tests meet quality standards: single-responsibility subtests, proper setup via helpers, in-memory resources (no cleanup), consistent assertions, and alignment with codebase Go testing conventions.
Topology-Aware Scheduling Compatibility ✅ Passed Deployment manifest has no replicas, affinity, nodeSelector, or topologySpreadConstraints; no topology-breaking scheduling constraints introduced.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR adds only standard Go unit tests using testing.T and Gomega, not Ginkgo e2e tests (which use It(), Describe(), etc.). The custom check does not apply to this PR.
No-Weak-Crypto ✅ Passed No weak crypto algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time comparisons found in the 15 modified files.
Container-Privileges ✅ Passed No privileged, hostPID, hostNetwork, hostIPC, SYS_ADMIN, or allowPrivilegeEscalation settings in new manifests; root is justified for kube-scheduler control plane component.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data logging found in PR. Code adds PKI/ServiceMonitor/RBAC logic without exposing passwords, tokens, keys, PII, or customer data in logs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels Jun 5, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dhgautam99
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

Codecov Report

❌ Patch coverage is 29.57746% with 50 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.43%. Comparing base (fce95fc) to head (0b0a67a).
⚠️ Report is 76 commits behind head on main.

Files with missing lines Patch % Lines
...igoperator/controllers/resources/rbac/reconcile.go 0.00% 22 Missing ⚠️
...igoperator/controllers/resources/manifests/rbac.go 0.00% 12 Missing ⚠️
support/metrics/sets.go 0.00% 8 Missing ⚠️
.../hostedcontrolplane/v2/kube_scheduler/component.go 0.00% 4 Missing ⚠️
...ostedcontrolplane/hostedcontrolplane_controller.go 50.00% 2 Missing and 1 partial ⚠️
...or/controllers/hostedcontrolplane/manifests/pki.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8680      +/-   ##
==========================================
+ Coverage   40.69%   41.43%   +0.74%     
==========================================
  Files         755      758       +3     
  Lines       93373    93718     +345     
==========================================
+ Hits        37994    38831     +837     
+ Misses      52646    52165     -481     
+ Partials     2733     2722      -11     
Files with missing lines Coverage Δ
...or/controllers/hostedcontrolplane/pki/scheduler.go 100.00% <100.00%> (ø)
...edcontrolplane/v2/kube_scheduler/servicemonitor.go 100.00% <100.00%> (ø)
...rconfigoperator/controllers/resources/resources.go 56.73% <100.00%> (+0.12%) ⬆️
...or/controllers/hostedcontrolplane/manifests/pki.go 0.00% <0.00%> (ø)
...ostedcontrolplane/hostedcontrolplane_controller.go 45.71% <50.00%> (+0.68%) ⬆️
.../hostedcontrolplane/v2/kube_scheduler/component.go 26.08% <0.00%> (+26.08%) ⬆️
support/metrics/sets.go 2.29% <0.00%> (-0.06%) ⬇️
...igoperator/controllers/resources/manifests/rbac.go 0.00% <0.00%> (ø)
...igoperator/controllers/resources/rbac/reconcile.go 0.00% <0.00%> (ø)

... and 46 files with indirect coverage changes

Flag Coverage Δ
cmd-support 34.87% <0.00%> (+0.16%) ⬆️
cpo-hostedcontrolplane 43.59% <70.37%> (+1.78%) ⬆️
cpo-other 42.63% <5.55%> (+1.24%) ⬆️
hypershift-operator 51.57% <ø> (+0.72%) ⬆️
other 31.64% <ø> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented Jun 5, 2026

I have all the information needed. Here is the analysis:

Test Failure Analysis Complete

Job Information

  • Prow Job: codecov/patch (GitHub check, not a Prow CI job)
  • Build ID: Check Run ID 79688744447
  • PR: #8680CNTRLPLANE-3584: Add RBAC for kube-scheduler /metrics/resources endpoint
  • Branch: enable-kube-scheduler-resource-metricsmain
  • Status: codecov/patch ❌ FAIL | codecov/project ✅ PASS

Test Failure Analysis

Error

Patch coverage is 29.57% with 50 lines in your changes missing coverage.
Target: 40.69% (project-level baseline). Patch coverage fell 11.12 percentage points below the target.

Summary

The codecov/patch check failed because 50 out of ~71 new executable lines added by this PR have zero unit-test coverage, yielding a patch coverage of only 29.57% — well below the project baseline of 40.69%. The uncovered code is concentrated in three areas: the new RBAC manifest/reconcile functions for hypershift-metrics-resources-reader (34 lines, 0%), the new SchedulerRelabelConfigs function in support/metrics/sets.go (8 lines, 0%), and minor gaps in component.go, pki.go, and hostedcontrolplane_controller.go. Notably, codecov/project passed — overall project coverage actually increased from 40.69% to 41.43% — so this is strictly a patch-level (new-code-only) coverage gap, not a regression of the broader codebase.

Root Cause

The PR introduces new Go functions that lack corresponding unit tests:

  1. manifests/rbac.go — 12 lines, 0% covered: Two new manifest constructor functions (MetricsResourcesClusterRole() and MetricsResourcesClusterRoleBinding()) that create the hypershift-metrics-resources-reader ClusterRole and ClusterRoleBinding. These are simple factory functions but are not exercised by any test.

  2. rbac/reconcile.go — 22 lines, 0% covered: Two new reconcile functions (ReconcileMetricsResourcesClusterRole and ReconcileMetricsResourcesClusterRoleBinding) that set the RBAC policy rules and subject bindings. These contain the core RBAC logic (NonResourceURL /metrics/resources with get verb, bound to system:serviceaccount:hypershift:prometheus) and are the most important uncovered code.

  3. support/metrics/sets.go — 8 lines, 0% covered: The new SchedulerRelabelConfigs() function (a switch on MetricsSet returning relabel configs for telemetry/SRE sets) has no tests. This follows the exact same pattern as the existing KASRelabelConfigs(), KCMRelabelConfigs(), etc., which also have 0% coverage — so the file has a pre-existing coverage debt pattern.

  4. v2/kube_scheduler/component.go — 4 lines, 0% covered: New code adding the scheduler-server secret as a volume source to the kube-scheduler component.

  5. manifests/pki.go — 1 line, 0% covered: The new SchedulerServerCertSecret() factory function.

  6. hostedcontrolplane_controller.go — 3 lines, 50% covered: The new cert reconciliation block for schedulerServerSecret — partially covered because the controller reconcile path is partially exercised by existing integration-style tests, but the new block's error path is not.

The PR does include good test coverage for some components — pki/scheduler.go (100%), v2/kube_scheduler/servicemonitor.go (100%), and resources/resources.go (100%) — but the RBAC manifest/reconcile layer and the metrics-set helper were shipped without tests.

Codecov uses the project-level coverage (40.69%) as the default patch target when no explicit target is set in codecov.yml. Since the patch coverage (29.57%) falls below this threshold, the check fails.

Recommendations
  1. Add unit tests for the RBAC reconcile functions (highest impact — 22 lines):

    • Test ReconcileMetricsResourcesClusterRole — verify it sets the correct NonResourceURLs (/metrics/resources) and Verbs (get).
    • Test ReconcileMetricsResourcesClusterRoleBinding — verify the RoleRef points to hypershift-metrics-resources-reader and the subject is system:serviceaccount:hypershift:prometheus.
    • These are simple struct-setting functions; the tests would be straightforward table-driven tests similar to existing RBAC reconcile tests in the same file.
  2. Add unit tests for the RBAC manifest constructors (12 lines):

    • Test MetricsResourcesClusterRole() and MetricsResourcesClusterRoleBinding() — verify they return objects with the correct name (hypershift-metrics-resources-reader).
  3. Add unit tests for SchedulerRelabelConfigs (8 lines):

    • Test the three MetricsSet branches (Telemetry, SRE, default). Note: the existing sibling functions (KASRelabelConfigs, KCMRelabelConfigs) also lack tests, so this is consistent with the pattern — but adding a test would be simple and would clear the coverage check.
  4. Alternatively, if the team considers these functions too trivial to test, the codecov/patch failure can be tolerated as a non-blocking signal — codecov/project already passes, confirming no overall coverage regression. Check with the team whether codecov/patch is a required merge check or informational only.

Evidence
Evidence Detail
Check result codecov/patch: failure; codecov/project: pass
Patch coverage 29.57% (21 of ~71 executable lines covered)
Target 40.69% (project baseline, no explicit override in codecov.yml)
Lines missing coverage 50 lines across 6 files
Largest gap rbac/reconcile.go — 22 lines at 0% (new ReconcileMetricsResourcesClusterRole and ReconcileMetricsResourcesClusterRoleBinding)
Second largest gap manifests/rbac.go — 12 lines at 0% (MetricsResourcesClusterRole(), MetricsResourcesClusterRoleBinding())
Third largest gap support/metrics/sets.go — 8 lines at 0% (SchedulerRelabelConfigs())
Well-covered files pki/scheduler.go (100%), servicemonitor.go (100%), resources.go (100%)
Project coverage delta +0.74% (40.69% → 41.43%) — overall coverage improved
Codecov config codecov.yml has no explicit patch target — defaults to project-level coverage
Codecov report Full report

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
control-plane-operator/hostedclusterconfigoperator/controllers/resources/rbac/reconcile.go (1)

294-298: ⚡ Quick win

Use manifest-derived role name to prevent drift.

Line 297 hardcodes "hypershift-metrics-resources-reader" instead of referencing the manifest constructor. If either side changes, role binding will silently point to the wrong role.

💡 Proposed fix
 	r.RoleRef = rbacv1.RoleRef{
 		APIGroup: rbacv1.SchemeGroupVersion.Group,
 		Kind:     "ClusterRole",
-		Name:     "hypershift-metrics-resources-reader",
+		Name:     hccomanifests.MetricsResourcesClusterRole().Name,
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/rbac/reconcile.go`
around lines 294 - 298, Replace the hardcoded RoleRef.Name string in the
r.RoleRef assignment with the canonical name produced by the manifest
constructor that builds the corresponding ClusterRole (the same code that
currently constructs "hypershift-metrics-resources-reader"); locate that
manifest constructor or exported name/constant used when creating the
ClusterRole and use it to set r.RoleRef.Name in the r.RoleRef assignment so the
RoleBinding always references the manifest-derived role name and cannot drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kube_scheduler/servicemonitor.go`:
- Around line 11-16: The adaptServiceMonitor function currently indexes
sm.Spec.Endpoints[0] without verifying endpoints exist, which can panic; update
adaptServiceMonitor to first check len(sm.Spec.Endpoints) > 0 and return a clear
error (or create/append a default endpoint if desired) when it is zero, then
only set MetricRelabelConfigs and call util.ApplyClusterIDLabel on
sm.Spec.Endpoints[0]; reference the adaptServiceMonitor function,
sm.Spec.Endpoints, MetricRelabelConfigs, and util.ApplyClusterIDLabel when
making the change.

In `@support/metrics/sets.go`:
- Around line 229-237: The Telemetry branch in SchedulerRelabelConfigs
incorrectly returns sreMetricsSetConfig.KubeScheduler causing Telemetry to
depend on the SRE config; change the MetricsSetTelemetry case to return the
Telemetry-specific relabel config (e.g., telemetryMetricsSetConfig.KubeScheduler
or a dedicated telemetry config variable) instead of
sreMetricsSetConfig.KubeScheduler, ensuring MetricsSetSRE still returns
sreMetricsSetConfig.KubeScheduler and the default returns nil.

---

Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/rbac/reconcile.go`:
- Around line 294-298: Replace the hardcoded RoleRef.Name string in the
r.RoleRef assignment with the canonical name produced by the manifest
constructor that builds the corresponding ClusterRole (the same code that
currently constructs "hypershift-metrics-resources-reader"); locate that
manifest constructor or exported name/constant used when creating the
ClusterRole and use it to set r.RoleRef.Name in the r.RoleRef assignment so the
RoleBinding always references the manifest-derived role name and cannot drift.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7d14cef1-a721-4b03-b73a-d3d4e5886071

📥 Commits

Reviewing files that changed from the base of the PR and between f13c62d and 0b0a67a.

⛔ Files ignored due to path filters (20)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/AROSwift/zz_fixture_TestControlPlaneComponents_kube_scheduler_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/AROSwift/zz_fixture_TestControlPlaneComponents_kube_scheduler_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/AROSwift/zz_fixture_TestControlPlaneComponents_kube_scheduler_service.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/AROSwift/zz_fixture_TestControlPlaneComponents_kube_scheduler_servicemonitor.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/GCP/zz_fixture_TestControlPlaneComponents_kube_scheduler_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/GCP/zz_fixture_TestControlPlaneComponents_kube_scheduler_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/GCP/zz_fixture_TestControlPlaneComponents_kube_scheduler_service.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/GCP/zz_fixture_TestControlPlaneComponents_kube_scheduler_servicemonitor.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/IBMCloud/zz_fixture_TestControlPlaneComponents_kube_scheduler_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/IBMCloud/zz_fixture_TestControlPlaneComponents_kube_scheduler_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/IBMCloud/zz_fixture_TestControlPlaneComponents_kube_scheduler_service.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/IBMCloud/zz_fixture_TestControlPlaneComponents_kube_scheduler_servicemonitor.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_kube_scheduler_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_kube_scheduler_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_kube_scheduler_service.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_kube_scheduler_servicemonitor.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/zz_fixture_TestControlPlaneComponents_kube_scheduler_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/zz_fixture_TestControlPlaneComponents_kube_scheduler_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/zz_fixture_TestControlPlaneComponents_kube_scheduler_service.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-scheduler/zz_fixture_TestControlPlaneComponents_kube_scheduler_servicemonitor.yaml is excluded by !**/testdata/**
📒 Files selected for processing (15)
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/manifests/pki.go
  • control-plane-operator/controllers/hostedcontrolplane/pki/scheduler.go
  • control-plane-operator/controllers/hostedcontrolplane/pki/scheduler_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/kube-scheduler/deployment.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/kube-scheduler/service.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/kube-scheduler/servicemonitor.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/kube_scheduler/component.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kube_scheduler/component_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kube_scheduler/servicemonitor.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kube_scheduler/servicemonitor_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/manifests/rbac.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/rbac/reconcile.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
  • support/metrics/sets.go

Comment on lines +11 to +16
func adaptServiceMonitor(cpContext component.WorkloadContext, sm *prometheusoperatorv1.ServiceMonitor) error {
sm.Spec.NamespaceSelector = prometheusoperatorv1.NamespaceSelector{
MatchNames: []string{sm.Namespace},
}
sm.Spec.Endpoints[0].MetricRelabelConfigs = metrics.SchedulerRelabelConfigs(cpContext.MetricsSet)
util.ApplyClusterIDLabel(&sm.Spec.Endpoints[0], cpContext.HCP.Spec.ClusterID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard endpoint indexing to avoid reconcile panic.

Line 15 and Line 16 dereference sm.Spec.Endpoints[0] without validating length. A zero-endpoint manifest will panic the controller instead of returning a reconcile error.

💡 Proposed fix
 import (
+	"fmt"
+
 	component "github.com/openshift/hypershift/support/controlplane-component"
 	"github.com/openshift/hypershift/support/metrics"
 	"github.com/openshift/hypershift/support/util"
@@
 func adaptServiceMonitor(cpContext component.WorkloadContext, sm *prometheusoperatorv1.ServiceMonitor) error {
 	sm.Spec.NamespaceSelector = prometheusoperatorv1.NamespaceSelector{
 		MatchNames: []string{sm.Namespace},
 	}
+	if len(sm.Spec.Endpoints) == 0 {
+		return fmt.Errorf("kube-scheduler ServiceMonitor must define at least one endpoint")
+	}
 	sm.Spec.Endpoints[0].MetricRelabelConfigs = metrics.SchedulerRelabelConfigs(cpContext.MetricsSet)
 	util.ApplyClusterIDLabel(&sm.Spec.Endpoints[0], cpContext.HCP.Spec.ClusterID)

As per coding guidelines: "Avoid panics in Go except in truly unrecoverable cases."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func adaptServiceMonitor(cpContext component.WorkloadContext, sm *prometheusoperatorv1.ServiceMonitor) error {
sm.Spec.NamespaceSelector = prometheusoperatorv1.NamespaceSelector{
MatchNames: []string{sm.Namespace},
}
sm.Spec.Endpoints[0].MetricRelabelConfigs = metrics.SchedulerRelabelConfigs(cpContext.MetricsSet)
util.ApplyClusterIDLabel(&sm.Spec.Endpoints[0], cpContext.HCP.Spec.ClusterID)
import (
"fmt"
component "github.com/openshift/hypershift/support/controlplane-component"
"github.com/openshift/hypershift/support/metrics"
"github.com/openshift/hypershift/support/util"
prometheusoperatorv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1"
)
func adaptServiceMonitor(cpContext component.WorkloadContext, sm *prometheusoperatorv1.ServiceMonitor) error {
sm.Spec.NamespaceSelector = prometheusoperatorv1.NamespaceSelector{
MatchNames: []string{sm.Namespace},
}
if len(sm.Spec.Endpoints) == 0 {
return fmt.Errorf("kube-scheduler ServiceMonitor must define at least one endpoint")
}
sm.Spec.Endpoints[0].MetricRelabelConfigs = metrics.SchedulerRelabelConfigs(cpContext.MetricsSet)
util.ApplyClusterIDLabel(&sm.Spec.Endpoints[0], cpContext.HCP.Spec.ClusterID)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kube_scheduler/servicemonitor.go`
around lines 11 - 16, The adaptServiceMonitor function currently indexes
sm.Spec.Endpoints[0] without verifying endpoints exist, which can panic; update
adaptServiceMonitor to first check len(sm.Spec.Endpoints) > 0 and return a clear
error (or create/append a default endpoint if desired) when it is zero, then
only set MetricRelabelConfigs and call util.ApplyClusterIDLabel on
sm.Spec.Endpoints[0]; reference the adaptServiceMonitor function,
sm.Spec.Endpoints, MetricRelabelConfigs, and util.ApplyClusterIDLabel when
making the change.

Comment thread support/metrics/sets.go
Comment on lines +229 to +237
func SchedulerRelabelConfigs(set MetricsSet) []prometheusoperatorv1.RelabelConfig {
switch set {
case MetricsSetTelemetry:
return sreMetricsSetConfig.KubeScheduler
case MetricsSetSRE:
return sreMetricsSetConfig.KubeScheduler
default:
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Telemetry branch is incorrectly coupled to SRE config state.

Line 232 returns sreMetricsSetConfig.KubeScheduler for MetricsSetTelemetry, which makes Telemetry output depend on the SRE ConfigMap content. This can silently alter Telemetry scrape behavior whenever SRE config is present.

💡 Proposed fix
 func SchedulerRelabelConfigs(set MetricsSet) []prometheusoperatorv1.RelabelConfig {
 	switch set {
 	case MetricsSetTelemetry:
-		return sreMetricsSetConfig.KubeScheduler
+		return nil
 	case MetricsSetSRE:
 		return sreMetricsSetConfig.KubeScheduler
 	default:
 		return nil
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/metrics/sets.go` around lines 229 - 237, The Telemetry branch in
SchedulerRelabelConfigs incorrectly returns sreMetricsSetConfig.KubeScheduler
causing Telemetry to depend on the SRE config; change the MetricsSetTelemetry
case to return the Telemetry-specific relabel config (e.g.,
telemetryMetricsSetConfig.KubeScheduler or a dedicated telemetry config
variable) instead of sreMetricsSetConfig.KubeScheduler, ensuring MetricsSetSRE
still returns sreMetricsSetConfig.KubeScheduler and the default returns nil.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants