Skip to content

CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator#7749

Open
enxebre wants to merge 3 commits into
openshift:mainfrom
enxebre:fix-CNTRLPLANE-2775
Open

CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator#7749
enxebre wants to merge 3 commits into
openshift:mainfrom
enxebre:fix-CNTRLPLANE-2775

Conversation

@enxebre
Copy link
Copy Markdown
Member

@enxebre enxebre commented Feb 19, 2026

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

  • New Features

    • Added KAS health metrics (availability gauge and request-duration histogram) and registered them for collection.
  • Tests

    • Added unit tests for KAS health metrics and end-to-end validation to check metrics are emitted during runs.
  • Chores

    • Metrics instrumentation initialized at startup so controller exposes KAS health metrics for monitoring.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 19, 2026

No actionable comments were generated in the recent review. 🎉


Walkthrough

Adds Prometheus metrics for KAS health (availability and request duration), threads an optional KASHealthMetrics through KAS health checks, initializes metrics at startup, and adds unit and E2E tests and helpers to validate metrics exposure. Some test and helper additions are duplicated in the diff.

Changes

Cohort / File(s) Summary
Controller instrumentation
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
Added HostedControlPlaneReconciler.KASHealthMetrics *kas.KASHealthMetrics; changed healthCheckKASEndpoint to accept m *kas.KASHealthMetrics; record request duration and set availability (guarded by nil check).
KAS metrics implementation
control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go
New metrics: KASAvailableMetricName, KASRequestDurationMetricName, KASRequestDurationBuckets; KASHealthMetrics struct with Available gauge and RequestDuration histogram; NewKASHealthMetrics() registering metrics.
Controller unit tests
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
Added TestHealthCheckKASEndpointMetrics with subtests (200, 503, unreachable, nil-metrics), helpers newTestKASHealthMetrics and parseHostPort. Note: test and helpers appear duplicated in the diff — inspect for unintended repeats.
KAS metrics unit tests
control-plane-operator/controllers/hostedcontrolplane/kas/metrics_test.go
New test validating metric registration, gauge initial state, gauge update, and histogram observation via a Prometheus registry.
Startup wiring
control-plane-operator/main.go
Instantiates kas.NewKASHealthMetrics() and injects it into HostedControlPlaneReconciler during startup/setup.
E2E helpers & integration
test/e2e/util/hypershift_framework.go, test/e2e/util/util.go
Adds ValidateCPOMetrics E2E helper and invokes it in after-phase; helper polls control-plane-operator metrics for kas.KASAvailableMetricName and kas.KASRequestDurationMetricName. Note: ValidateCPOMetrics appears duplicated in the diff — verify and dedupe.

Sequence Diagram(s)

sequenceDiagram
  participant Tests as Tests/E2E
  participant Reconciler as HostedControlPlaneReconciler
  participant KAS as KAS/API
  participant Metrics as Prometheus/Registry

  Tests->>Reconciler: trigger health check
  Reconciler->>KAS: HTTP request to ingress (healthCheckKASEndpoint with m)
  activate KAS
  KAS-->>Reconciler: HTTP response (200/503/timeout)
  deactivate KAS
  alt m != nil
    Reconciler->>Metrics: observe RequestDuration
    Reconciler->>Metrics: set Available = 1 or 0
  end
  Tests->>Metrics: query metrics endpoint to validate metrics present/values
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning PR contains duplicated test implementations in hostedcontrolplane_controller_test.go and util.go, lacks meaningful assertion messages, and has insufficient timeout safeguards in E2E polling operations. Remove duplicate test functions, add descriptive failure messages to all assertions, and ensure explicit timeout configuration for all polling operations.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main objective: exposing KAS availability and latency metrics from the control-plane-operator, which aligns with the core implementation across all modified files.
Stable And Deterministic Test Names ✅ Passed All test names in the PR use stable, deterministic naming with no dynamic content, formatted strings, or variable substitution.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 19, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented Feb 19, 2026

/cc @muraee @csrwng

@openshift-ci openshift-ci Bot requested review from csrwng and muraee February 19, 2026 12:36
@openshift-ci openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Feb 19, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Feb 19, 2026
@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented Feb 19, 2026

/auto-cc

@openshift-ci openshift-ci Bot requested review from jparrill and sjenning February 19, 2026 12:41
@typeid
Copy link
Copy Markdown
Member

typeid commented Feb 19, 2026

Just to clarify a bit further, while these metrics are great, in ROSA HCP we intend to probe KAS availability externally via RHOBS synthetic monitoring (SREP-333), where Blackbox Exporter runs on RHOBS cells outside the management cluster. This gives us the advantage of testing the actual customer-facing network path (only partially for private API), including DNS resolution, load balancer health, and regional routing, rather than probing from within the MC's own network.

I understand ARO HCP wants to avoid the RMO dependency, and these metrics help with that. However, since the CPO probe originates from within the MC, it's not a full replacement for external synthetic monitoring for SLA purposes IMO.

That said, these CPO-local metrics are definitely useful even for the ROSA side for faster internal detection of control plane issues, for example catching KAS pod crashes or pinpointing network issues to in-cluster networking failures.

LGTM & thanks for the addition!

@typeid
Copy link
Copy Markdown
Member

typeid commented Feb 19, 2026

Also cc @dustman9000 as a FYI that this exists and is now being extended with latency as well :)

@muraee
Copy link
Copy Markdown
Contributor

muraee commented Feb 19, 2026

lgtm

@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented Feb 19, 2026

/test e2e-aws

@enxebre enxebre marked this pull request as ready for review February 19, 2026 13:59
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

Release Notes

  • New Features

  • Added Kubernetes API Server (KAS) health metrics monitoring with Prometheus instrumentation, tracking request duration and availability status.

  • Tests

  • Added comprehensive validation tests for KAS health metrics functionality.

  • Integrated Control Plane Operator metrics validation into end-to-end test workflows.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/util/util.go`:
- Around line 2712-2743: The loop in ValidateCPOMetrics uses
ValidateMetricPresence which expects labeled metrics and therefore never matches
label-less KAS metrics; modify the inner check after GetMetricsFromPod to look
for the MetricFamily by name (from the returned mf MetricFamily map or slice)
for kas.KASAvailableMetricName and kas.KASRequestDurationMetricName instead of
calling ValidateMetricPresence, i.e., verify the MetricFamily exists and has at
least one metric (no label checks) before returning true; keep the surrounding
wait.PollUntilContextTimeout, GetMetricsFromPod, and error handling unchanged.

Comment thread test/e2e/util/util.go
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 19, 2026

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description
hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise
hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

  • Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
  • Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
  • The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
  • Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

  • control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
  • control-plane-operator/main.go — metrics initialization

Testing

  • Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
  • E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
  • All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

  • New Features

  • Added KAS health metrics: availability and request-duration metrics exposed for monitoring.

  • Tests

  • Added unit tests for KAS health metrics and integrated control-plane metrics validation into end-to-end test flows.

  • Chores

  • Instrumentation initialized at startup so metrics are available from the controller runtime.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@enxebre enxebre force-pushed the fix-CNTRLPLANE-2775 branch from 51eb7d6 to e94ec9e Compare February 19, 2026 17:24
@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented Feb 19, 2026

/test e2e-aws
/verified by e2e

@muraee
Copy link
Copy Markdown
Contributor

muraee commented May 28, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented May 28, 2026

/retest

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2059995860124569600 | Cost: $1.7596345 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented May 29, 2026

/retest

@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented May 29, 2026

/teset e2e-aws

@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 29, 2026
Add tests for the healthCheckKASEndpoint function that verify metrics
are correctly recorded during health check probes:

- Gauge set to 1 and histogram observed on successful 200 response
- Gauge set to 0 on non-200 response (503)
- Gauge set to 0 on connection error (unreachable endpoint)
- No panic when metrics is nil (backward compatibility)

Also add a basic test for KASHealthMetrics construction and registration.

Ref: CNTRLPLANE-2775

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@enxebre enxebre force-pushed the fix-CNTRLPLANE-2775 branch from d4f8b15 to 4946853 Compare May 29, 2026 11:16
@bryan-cox
Copy link
Copy Markdown
Member

/retest

@typeid
Copy link
Copy Markdown
Member

typeid commented May 29, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

Add ValidateCPOMetrics function that verifies both
hypershift_kube_apiserver_available and
hypershift_kube_apiserver_request_duration_seconds metrics are present
on the control-plane-operator pod's metrics endpoint (port 8080).

The validation runs as an inline check in TestCreateCluster alongside
other Ensure* validations, rather than in the after() hook, to avoid
blocking all e2e tests if metrics emission is unstable.
It follows the established pattern using GetMetricsFromPod with
polling (10s interval, 5min timeout).

Ref: CNTRLPLANE-2775

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@enxebre enxebre force-pushed the fix-CNTRLPLANE-2775 branch from 4946853 to db0c022 Compare June 2, 2026 08:13
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 2, 2026
@devguyio
Copy link
Copy Markdown
Contributor

devguyio commented Jun 2, 2026

/lgtm yolo

@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented Jun 2, 2026

/test e2e-aws

@enxebre
Copy link
Copy Markdown
Member Author

enxebre commented Jun 2, 2026

/test e2e-aws-4-22

@muraee
Copy link
Copy Markdown
Contributor

muraee commented Jun 2, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 2, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2061747161749524480 | Cost: $1.8019642499999997 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 2, 2026

@enxebre: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/unit e94ec9e link true /test unit
ci/prow/verify-workflows e94ec9e link true /test verify-workflows
ci/prow/e2e-aks-4-22 4946853 link true /test e2e-aks-4-22
ci/prow/e2e-aws-4-22 db0c022 link true /test e2e-aws-4-22
ci/prow/e2e-aws db0c022 link true /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented Jun 2, 2026

I have all the evidence I need. Let me produce the final report.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

hypershift_framework.go:518: failed to create cluster, tearing down: failed to create infra: 
cannot create VPC S3 endpoint: operation error EC2: CreateVpcEndpoint, exceeded maximum number 
of attempts, 11, https response error StatusCode: 503, RequestID: ff01d342-633f-4b11-a9bf-cc9bb6bfb381, 
api error RequestLimitExceeded: Request limit exceeded. Account 820196288204 has been throttled 
on ec2:CreateVpcEndpoint because it exceeded its request rate limit.

Summary

The TestKarpenterUpgradeControlPlane test failed due to AWS EC2 API rate limiting — the shared CI AWS account (820196288204) was throttled on ec2:CreateVpcEndpoint calls, preventing the test from provisioning its HostedCluster infrastructure. This is an infrastructure-level flake completely unrelated to the PR's changes (which add KAS availability/latency Prometheus metrics to the control-plane-operator). The test suite ran 20 tests in parallel, all creating VPC infrastructure simultaneously, and this particular test hit the AWS rate limit ceiling. The test then spent ~2 hours in a blocked Teardown phase trying to clean up resources that were never fully created, eventually being killed by the 2-hour pod timeout with exit code 127 (triggered by a panic in the post-test alertSLOs function when the context was already canceled).

Root Cause

AWS EC2 API Rate Limiting (Infrastructure Flake — Unrelated to PR)

The e2e test suite runs 20 tests in parallel (-test.parallel=20), each creating its own HostedCluster with full AWS infrastructure (VPCs, subnets, endpoints, IAM roles, etc.). The TestKarpenterUpgradeControlPlane test failed at cluster creation time because the shared CI AWS account was throttled by AWS on the ec2:CreateVpcEndpoint API call.

The failure sequence was:

  1. 10:24:24 UTC — Test suite starts, 20 tests begin creating clusters simultaneously
  2. ~10:24:46 UTCTestKarpenterUpgradeControlPlane attempts to create VPC S3 endpoint during infra setup
  3. AWS returns HTTP 503 with RequestLimitExceeded — the account has been throttled on ec2:CreateVpcEndpoint
  4. The SDK retried 11 times (max attempts) and gave up
  5. The test enters Teardown, trying to destroy a cluster that never fully existed (the HostedCluster resource karpenter-upgrade-control-plane-ck879 was never created)
  6. Teardown gets stuck for ~7126 seconds (nearly 2 hours) waiting for finalization of a cluster that doesn't exist
  7. 12:24:22 UTC — The 2-hour pod timeout fires, sending a shutdown signal
  8. Context cancellation cascades: IAM destroy fails, namespace deletion fails, rate limiter rejects new requests
  9. TestMain panics in alertSLOsNewPrometheusClient when the Prometheus service lookup fails due to canceled context
  10. The panic causes exit code 127, and the CI step script retries the test (second run shows the same cached error output)

The PR's changes (adding KAS metrics in control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go and related files) modify the control-plane-operator, not the e2e test infrastructure, VPC creation, or Karpenter test logic. The only test files changed are create_cluster_test.go and util/util.go, which add a ValidateMetricsAreExposed validation — and that validation passed in all tests that successfully created clusters.

Recommendations
  1. Rerun the job — This is a transient AWS rate limiting flake. A retry should succeed if the CI account isn't under heavy concurrent load.
  2. No code changes needed — The failure is in TestKarpenterUpgradeControlPlane infrastructure setup, which is completely independent of the PR's KAS metrics changes.
  3. For CI infrastructure teams — The 20-parallel test setup creates significant burst load on AWS API endpoints. Consider:
    • Adding exponential backoff with jitter for VPC endpoint creation retries
    • Staggering cluster creation across tests instead of launching all 20 simultaneously
    • Requesting higher API rate limits for the CI AWS account on ec2:CreateVpcEndpoint
Evidence
Evidence Detail
Failed test TestKarpenterUpgradeControlPlane (and its Teardown subtest)
Error type AWS EC2 API rate limiting (RequestLimitExceeded)
AWS API ec2:CreateVpcEndpoint — HTTP 503
AWS Account 820196288204 (shared CI account)
RequestID ff01d342-633f-4b11-a9bf-cc9bb6bfb381
Retry attempts 11 (maximum reached)
Test duration 7198.54s (~2 hours — mostly stuck in Teardown)
Teardown duration 7126.39s (~1h58m waiting for non-existent cluster)
Exit code 127 (panic in TestMainalertSLOsNewPrometheusClient at util.go:1574)
Panic cause Context canceled during shutdown — prometheus-k8s service lookup failed
Test parallelism 20 tests simultaneously creating AWS infrastructure
Total tests 594 run, 25 skipped, 2 failures (both same test)
Pass rate 99.7% — all other tests passed including tests exercising PR changes
PR relevance None — PR changes KAS metrics in CPO, failure is in Karpenter VPC infra setup
PR's ValidateMetricsAreExposed Passed in all clusters that were successfully created (TestAutoscaling, TestNodePool, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.