CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator by enxebre · Pull Request #7749 · openshift/hypershift

enxebre · 2026-02-19T12:23:39Z

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric	Type	Description
`hypershift_kube_apiserver_available`	Gauge	1 if `/healthz` returns HTTP 200, 0 otherwise
`hypershift_kube_apiserver_request_duration_seconds`	Histogram	Latency of the `/healthz` probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required
Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster
The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement
Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check
control-plane-operator/main.go — metrics initialization

Testing

Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios
E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)
All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

New Features
- Added KAS health metrics (availability gauge and request-duration histogram) and registered them for collection.
Tests
- Added unit tests for KAS health metrics and end-to-end validation to check metrics are emitted during runs.
Chores
- Metrics instrumentation initialized at startup so controller exposes KAS health metrics for monitoring.

openshift-ci-robot · 2026-02-19T12:23:41Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-02-19T12:23:43Z

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description

hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise

hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required

Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster

The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement

Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration

control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check

control-plane-operator/main.go — metrics initialization

Testing

Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios

E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)

All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-02-19T12:23:49Z

No actionable comments were generated in the recent review. 🎉

Walkthrough

Adds Prometheus metrics for KAS health (availability and request duration), threads an optional KASHealthMetrics through KAS health checks, initializes metrics at startup, and adds unit and E2E tests and helpers to validate metrics exposure. Some test and helper additions are duplicated in the diff.

Changes

Cohort / File(s)	Summary
Controller instrumentation `control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`	Added `HostedControlPlaneReconciler.KASHealthMetrics kas.KASHealthMetrics`; changed `healthCheckKASEndpoint` to accept `m kas.KASHealthMetrics`; record request duration and set availability (guarded by nil check).
KAS metrics implementation `control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go`	New metrics: `KASAvailableMetricName`, `KASRequestDurationMetricName`, `KASRequestDurationBuckets`; `KASHealthMetrics` struct with `Available` gauge and `RequestDuration` histogram; `NewKASHealthMetrics()` registering metrics.
Controller unit tests `control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go`	Added `TestHealthCheckKASEndpointMetrics` with subtests (200, 503, unreachable, nil-metrics), helpers `newTestKASHealthMetrics` and `parseHostPort`. Note: test and helpers appear duplicated in the diff — inspect for unintended repeats.
KAS metrics unit tests `control-plane-operator/controllers/hostedcontrolplane/kas/metrics_test.go`	New test validating metric registration, gauge initial state, gauge update, and histogram observation via a Prometheus registry.
Startup wiring `control-plane-operator/main.go`	Instantiates `kas.NewKASHealthMetrics()` and injects it into `HostedControlPlaneReconciler` during startup/setup.
E2E helpers & integration `test/e2e/util/hypershift_framework.go`, `test/e2e/util/util.go`	Adds `ValidateCPOMetrics` E2E helper and invokes it in after-phase; helper polls control-plane-operator metrics for `kas.KASAvailableMetricName` and `kas.KASRequestDurationMetricName`. Note: `ValidateCPOMetrics` appears duplicated in the diff — verify and dedupe.

Sequence Diagram(s)

sequenceDiagram
  participant Tests as Tests/E2E
  participant Reconciler as HostedControlPlaneReconciler
  participant KAS as KAS/API
  participant Metrics as Prometheus/Registry

  Tests->>Reconciler: trigger health check
  Reconciler->>KAS: HTTP request to ingress (healthCheckKASEndpoint with m)
  activate KAS
  KAS-->>Reconciler: HTTP response (200/503/timeout)
  deactivate KAS
  alt m != nil
    Reconciler->>Metrics: observe RequestDuration
    Reconciler->>Metrics: set Available = 1 or 0
  end
  Tests->>Metrics: query metrics endpoint to validate metrics present/values

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	PR contains duplicated test implementations in hostedcontrolplane_controller_test.go and util.go, lacks meaningful assertion messages, and has insufficient timeout safeguards in E2E polling operations.	Remove duplicate test functions, add descriptive failure messages to all assertions, and ensure explicit timeout configuration for all polling operations.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main objective: exposing KAS availability and latency metrics from the control-plane-operator, which aligns with the core implementation across all modified files.
Stable And Deterministic Test Names	✅ Passed	All test names in the PR use stable, deterministic naming with no dynamic content, formatted strings, or variable substitution.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-02-19T12:24:59Z

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description

hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise

hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required

Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster

The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement

Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration

control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check

control-plane-operator/main.go — metrics initialization

Testing

Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios

E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)

All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-02-19T12:25:25Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

enxebre · 2026-02-19T12:35:56Z

/cc @muraee @csrwng

openshift-ci · 2026-02-19T12:37:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enxebre · 2026-02-19T12:40:52Z

/auto-cc

typeid · 2026-02-19T13:34:37Z

Just to clarify a bit further, while these metrics are great, in ROSA HCP we intend to probe KAS availability externally via RHOBS synthetic monitoring (SREP-333), where Blackbox Exporter runs on RHOBS cells outside the management cluster. This gives us the advantage of testing the actual customer-facing network path (only partially for private API), including DNS resolution, load balancer health, and regional routing, rather than probing from within the MC's own network.

I understand ARO HCP wants to avoid the RMO dependency, and these metrics help with that. However, since the CPO probe originates from within the MC, it's not a full replacement for external synthetic monitoring for SLA purposes IMO.

That said, these CPO-local metrics are definitely useful even for the ROSA side for faster internal detection of control plane issues, for example catching KAS pod crashes or pinpointing network issues to in-cluster networking failures.

LGTM & thanks for the addition!

typeid · 2026-02-19T13:35:19Z

Also cc @dustman9000 as a FYI that this exists and is now being extended with latency as well :)

muraee · 2026-02-19T13:56:59Z

lgtm

enxebre · 2026-02-19T13:59:24Z

/test e2e-aws

openshift-ci-robot · 2026-02-19T14:10:04Z

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description

hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise

hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required

Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster

The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement

Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration

control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check

control-plane-operator/main.go — metrics initialization

Testing

Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios

E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)

All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

Release Notes

New Features

Added Kubernetes API Server (KAS) health metrics monitoring with Prometheus instrumentation, tracking request duration and availability status.

Tests

Added comprehensive validation tests for KAS health metrics functionality.

Integrated Control Plane Operator metrics validation into end-to-end test workflows.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/util/util.go`:
- Around line 2712-2743: The loop in ValidateCPOMetrics uses
ValidateMetricPresence which expects labeled metrics and therefore never matches
label-less KAS metrics; modify the inner check after GetMetricsFromPod to look
for the MetricFamily by name (from the returned mf MetricFamily map or slice)
for kas.KASAvailableMetricName and kas.KASRequestDurationMetricName instead of
calling ValidateMetricPresence, i.e., verify the MetricFamily exists and has at
least one metric (no label checks) before returning true; keep the surrounding
wait.PollUntilContextTimeout, GetMetricsFromPod, and error handling unchanged.

openshift-ci-robot · 2026-02-19T17:19:00Z

@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Instruments the existing healthCheckKASEndpoint() function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:

Metric Type Description

hypershift_kube_apiserver_available Gauge 1 if /healthz returns HTTP 200, 0 otherwise

hypershift_kube_apiserver_request_duration_seconds Histogram Latency of the /healthz probe (buckets: 0.01–10s)

Why

HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.

How

Metrics are registered with the controller-runtime metrics registry and automatically scraped by the existing PodMonitor for the CPO — no new monitoring infrastructure required

Each CPO pod runs in its own HCP namespace, so metrics are naturally scoped per hosted cluster

The existing HostedControlPlaneAvailable condition logic is unchanged — metrics are a side-effect, not a replacement

Works across all endpoint topologies: private clusters, public with Route, public with LoadBalancer, shared ingress (ARO HCP / ROSA HCP)

Key files

control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go — metric definitions and registration

control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — instrumented health check

control-plane-operator/main.go — metrics initialization

Testing

Unit tests verify gauge and histogram are correctly set for success (200), failure (503), and unreachable scenarios

E2e test validates metric presence on the CPO pod using GetMetricsFromPod/ValidateMetricPresence (Karpenter pattern)

All existing tests pass (make test exits 0)

Jira

CNTRLPLANE-2775

🤖 Generated with Claude Code via /jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)

Summary by CodeRabbit

New Features

Added KAS health metrics: availability and request-duration metrics exposed for monitoring.

Tests

Added unit tests for KAS health metrics and integrated control-plane metrics validation into end-to-end test flows.

Chores

Instrumentation initialized at startup so metrics are available from the controller runtime.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

enxebre · 2026-02-19T17:25:07Z

/test e2e-aws
/verified by e2e

muraee · 2026-05-28T13:49:11Z

/lgtm

openshift-merge-bot · 2026-05-28T13:50:21Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

enxebre · 2026-05-28T14:03:22Z

/retest

hypershift-jira-solve-ci · 2026-05-28T16:28:03Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2059995860124569600 | Cost: $1.7596345 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

enxebre · 2026-05-29T09:41:06Z

/retest

enxebre · 2026-05-29T10:38:43Z

/teset e2e-aws

Add tests for the healthCheckKASEndpoint function that verify metrics are correctly recorded during health check probes: - Gauge set to 1 and histogram observed on successful 200 response - Gauge set to 0 on non-200 response (503) - Gauge set to 0 on connection error (unreachable endpoint) - No panic when metrics is nil (backward compatibility) Also add a basic test for KASHealthMetrics construction and registration. Ref: CNTRLPLANE-2775 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bryan-cox · 2026-05-29T17:07:55Z

/retest

typeid · 2026-05-29T17:11:27Z

/lgtm

openshift-merge-bot · 2026-05-29T17:11:45Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

Add ValidateCPOMetrics function that verifies both hypershift_kube_apiserver_available and hypershift_kube_apiserver_request_duration_seconds metrics are present on the control-plane-operator pod's metrics endpoint (port 8080). The validation runs as an inline check in TestCreateCluster alongside other Ensure* validations, rather than in the after() hook, to avoid blocking all e2e tests if metrics emission is unstable. It follows the established pattern using GetMetricsFromPod with polling (10s interval, 5min timeout). Ref: CNTRLPLANE-2775 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

devguyio · 2026-06-02T08:15:14Z

/lgtm yolo

enxebre · 2026-06-02T09:49:56Z

/test e2e-aws

enxebre · 2026-06-02T09:50:09Z

/test e2e-aws-4-22

muraee · 2026-06-02T09:56:43Z

/lgtm

openshift-merge-bot · 2026-06-02T09:57:02Z

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

hypershift-jira-solve-ci · 2026-06-02T12:38:29Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2061747161749524480 | Cost: $1.8019642499999997 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

openshift-ci · 2026-06-02T12:48:43Z

@enxebre: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/unit	`e94ec9e`	link	true	`/test unit`
ci/prow/verify-workflows	`e94ec9e`	link	true	`/test verify-workflows`
ci/prow/e2e-aks-4-22	`4946853`	link	true	`/test e2e-aks-4-22`
ci/prow/e2e-aws-4-22	`db0c022`	link	true	`/test e2e-aws-4-22`
ci/prow/e2e-aws	`db0c022`	link	true	`/test e2e-aws`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hypershift-jira-solve-ci · 2026-06-02T12:52:47Z

I have all the evidence I need. Let me produce the final report.

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
Build ID: 2061747161749524480
Target: e2e-aws
PR: CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator #7749 (CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator)
Result: 594 tests run, 25 skipped, 2 failures (both in TestKarpenterUpgradeControlPlane)

Test Failure Analysis

Error

hypershift_framework.go:518: failed to create cluster, tearing down: failed to create infra: 
cannot create VPC S3 endpoint: operation error EC2: CreateVpcEndpoint, exceeded maximum number 
of attempts, 11, https response error StatusCode: 503, RequestID: ff01d342-633f-4b11-a9bf-cc9bb6bfb381, 
api error RequestLimitExceeded: Request limit exceeded. Account 820196288204 has been throttled 
on ec2:CreateVpcEndpoint because it exceeded its request rate limit.

Summary

The TestKarpenterUpgradeControlPlane test failed due to AWS EC2 API rate limiting — the shared CI AWS account (820196288204) was throttled on ec2:CreateVpcEndpoint calls, preventing the test from provisioning its HostedCluster infrastructure. This is an infrastructure-level flake completely unrelated to the PR's changes (which add KAS availability/latency Prometheus metrics to the control-plane-operator). The test suite ran 20 tests in parallel, all creating VPC infrastructure simultaneously, and this particular test hit the AWS rate limit ceiling. The test then spent ~2 hours in a blocked Teardown phase trying to clean up resources that were never fully created, eventually being killed by the 2-hour pod timeout with exit code 127 (triggered by a panic in the post-test alertSLOs function when the context was already canceled).

Root Cause

AWS EC2 API Rate Limiting (Infrastructure Flake — Unrelated to PR)

The e2e test suite runs 20 tests in parallel (-test.parallel=20), each creating its own HostedCluster with full AWS infrastructure (VPCs, subnets, endpoints, IAM roles, etc.). The TestKarpenterUpgradeControlPlane test failed at cluster creation time because the shared CI AWS account was throttled by AWS on the ec2:CreateVpcEndpoint API call.

The failure sequence was:

10:24:24 UTC — Test suite starts, 20 tests begin creating clusters simultaneously
~10:24:46 UTC — TestKarpenterUpgradeControlPlane attempts to create VPC S3 endpoint during infra setup
AWS returns HTTP 503 with RequestLimitExceeded — the account has been throttled on ec2:CreateVpcEndpoint
The SDK retried 11 times (max attempts) and gave up
The test enters Teardown, trying to destroy a cluster that never fully existed (the HostedCluster resource karpenter-upgrade-control-plane-ck879 was never created)
Teardown gets stuck for ~7126 seconds (nearly 2 hours) waiting for finalization of a cluster that doesn't exist
12:24:22 UTC — The 2-hour pod timeout fires, sending a shutdown signal
Context cancellation cascades: IAM destroy fails, namespace deletion fails, rate limiter rejects new requests
TestMain panics in alertSLOs → NewPrometheusClient when the Prometheus service lookup fails due to canceled context
The panic causes exit code 127, and the CI step script retries the test (second run shows the same cached error output)

The PR's changes (adding KAS metrics in control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go and related files) modify the control-plane-operator, not the e2e test infrastructure, VPC creation, or Karpenter test logic. The only test files changed are create_cluster_test.go and util/util.go, which add a ValidateMetricsAreExposed validation — and that validation passed in all tests that successfully created clusters.

Recommendations

Rerun the job — This is a transient AWS rate limiting flake. A retry should succeed if the CI account isn't under heavy concurrent load.
No code changes needed — The failure is in TestKarpenterUpgradeControlPlane infrastructure setup, which is completely independent of the PR's KAS metrics changes.
For CI infrastructure teams — The 20-parallel test setup creates significant burst load on AWS API endpoints. Consider:
- Adding exponential backoff with jitter for VPC endpoint creation retries
- Staggering cluster creation across tests instead of launching all 20 simultaneously
- Requesting higher API rate limits for the CI AWS account on ec2:CreateVpcEndpoint

Evidence

Evidence	Detail
Failed test	`TestKarpenterUpgradeControlPlane` (and its `Teardown` subtest)
Error type	AWS EC2 API rate limiting (`RequestLimitExceeded`)
AWS API	`ec2:CreateVpcEndpoint` — HTTP 503
AWS Account	`820196288204` (shared CI account)
RequestID	`ff01d342-633f-4b11-a9bf-cc9bb6bfb381`
Retry attempts	11 (maximum reached)
Test duration	7198.54s (~2 hours — mostly stuck in Teardown)
Teardown duration	7126.39s (~1h58m waiting for non-existent cluster)
Exit code	127 (panic in `TestMain` → `alertSLOs` → `NewPrometheusClient` at `util.go:1574`)
Panic cause	Context canceled during shutdown — `prometheus-k8s` service lookup failed
Test parallelism	20 tests simultaneously creating AWS infrastructure
Total tests	594 run, 25 skipped, 2 failures (both same test)
Pass rate	99.7% — all other tests passed including tests exercising PR changes
PR relevance	None — PR changes KAS metrics in CPO, failure is in Karpenter VPC infra setup
PR's ValidateMetricsAreExposed	Passed in all clusters that were successfully created (TestAutoscaling, TestNodePool, etc.)

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 19, 2026

openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Feb 19, 2026

openshift-ci Bot requested review from csrwng and muraee February 19, 2026 12:36

openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Feb 19, 2026

openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Feb 19, 2026

openshift-ci Bot requested review from jparrill and sjenning February 19, 2026 12:41

enxebre marked this pull request as ready for review February 19, 2026 13:59

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 19, 2026

coderabbitai Bot reviewed Feb 19, 2026

View reviewed changes

Comment thread test/e2e/util/util.go

enxebre force-pushed the fix-CNTRLPLANE-2775 branch from 51eb7d6 to e94ec9e Compare February 19, 2026 17:24

openshift-ci Bot assigned muraee May 28, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2026

This was referenced May 28, 2026

NO-JIRA: feat(skills): add validate-pr-override-images skill #8616

Merged

OCPBUGS-86661: Konnectivity retry proxy connection on timeout #8579

Open

openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 29, 2026

enxebre force-pushed the fix-CNTRLPLANE-2775 branch from d4f8b15 to 4946853 Compare May 29, 2026 11:16

openshift-ci Bot assigned typeid May 29, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2026

enxebre force-pushed the fix-CNTRLPLANE-2775 branch from 4946853 to db0c022 Compare June 2, 2026 08:13

openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 2, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 2, 2026

Conversation

enxebre commented Feb 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why

How

Key files

Testing

Jira

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Feb 19, 2026

Uh oh!

openshift-ci-robot commented Feb 19, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why

How

Key files

Testing

Jira

Uh oh!

coderabbitai Bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci-robot commented Feb 19, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why

How

Key files

Testing

Jira

Uh oh!

openshift-ci Bot commented Feb 19, 2026

Uh oh!

enxebre commented Feb 19, 2026

Uh oh!

openshift-ci Bot commented Feb 19, 2026

Uh oh!

enxebre commented Feb 19, 2026

Uh oh!

typeid commented Feb 19, 2026

Uh oh!

typeid commented Feb 19, 2026

Uh oh!

muraee commented Feb 19, 2026

Uh oh!

enxebre commented Feb 19, 2026

Uh oh!

openshift-ci-robot commented Feb 19, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why

How

Key files

Testing

Jira

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci-robot commented Feb 19, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why

How

Key files

Testing

Jira

Summary by CodeRabbit

Uh oh!

enxebre commented Feb 19, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Feb 19, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Feb 19, 2026 •

edited

Loading

openshift-ci-robot commented Feb 19, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Feb 19, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Feb 19, 2026 •

edited by openshift-ci Bot

Loading

hypershift-jira-solve-ci Bot commented Jun 2, 2026 •

edited by openshift-ci Bot

Loading