CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator#7749
CNTRLPLANE-2775: Expose KAS availability and latency metrics from the control-plane-operator#7749enxebre wants to merge 3 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 WalkthroughAdds Prometheus metrics for KAS health (availability and request duration), threads an optional Changes
Sequence Diagram(s)sequenceDiagram
participant Tests as Tests/E2E
participant Reconciler as HostedControlPlaneReconciler
participant KAS as KAS/API
participant Metrics as Prometheus/Registry
Tests->>Reconciler: trigger health check
Reconciler->>KAS: HTTP request to ingress (healthCheckKASEndpoint with m)
activate KAS
KAS-->>Reconciler: HTTP response (200/503/timeout)
deactivate KAS
alt m != nil
Reconciler->>Metrics: observe RequestDuration
Reconciler->>Metrics: set Available = 1 or 0
end
Tests->>Metrics: query metrics endpoint to validate metrics present/values
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
|
@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/auto-cc |
|
Just to clarify a bit further, while these metrics are great, in ROSA HCP we intend to probe KAS availability externally via RHOBS synthetic monitoring (SREP-333), where Blackbox Exporter runs on RHOBS cells outside the management cluster. This gives us the advantage of testing the actual customer-facing network path (only partially for private API), including DNS resolution, load balancer health, and regional routing, rather than probing from within the MC's own network. I understand ARO HCP wants to avoid the RMO dependency, and these metrics help with that. However, since the CPO probe originates from within the MC, it's not a full replacement for external synthetic monitoring for SLA purposes IMO. That said, these CPO-local metrics are definitely useful even for the ROSA side for faster internal detection of control plane issues, for example catching KAS pod crashes or pinpointing network issues to in-cluster networking failures. LGTM & thanks for the addition! |
|
Also cc @dustman9000 as a FYI that this exists and is now being extended with latency as well :) |
|
lgtm |
|
/test e2e-aws |
|
@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/util/util.go`:
- Around line 2712-2743: The loop in ValidateCPOMetrics uses
ValidateMetricPresence which expects labeled metrics and therefore never matches
label-less KAS metrics; modify the inner check after GetMetricsFromPod to look
for the MetricFamily by name (from the returned mf MetricFamily map or slice)
for kas.KASAvailableMetricName and kas.KASRequestDurationMetricName instead of
calling ValidateMetricPresence, i.e., verify the MetricFamily exists and has at
least one metric (no label checks) before returning true; keep the surrounding
wait.PollUntilContextTimeout, GetMetricsFromPod, and error handling unchanged.
|
@enxebre: This pull request references CNTRLPLANE-2775 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
51eb7d6 to
e94ec9e
Compare
|
/test e2e-aws |
|
/lgtm |
|
Scheduling tests matching the |
|
/retest |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/retest |
|
/teset e2e-aws |
Add tests for the healthCheckKASEndpoint function that verify metrics are correctly recorded during health check probes: - Gauge set to 1 and histogram observed on successful 200 response - Gauge set to 0 on non-200 response (503) - Gauge set to 0 on connection error (unreachable endpoint) - No panic when metrics is nil (backward compatibility) Also add a basic test for KASHealthMetrics construction and registration. Ref: CNTRLPLANE-2775 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d4f8b15 to
4946853
Compare
|
/retest |
|
/lgtm |
|
Scheduling tests matching the |
Add ValidateCPOMetrics function that verifies both hypershift_kube_apiserver_available and hypershift_kube_apiserver_request_duration_seconds metrics are present on the control-plane-operator pod's metrics endpoint (port 8080). The validation runs as an inline check in TestCreateCluster alongside other Ensure* validations, rather than in the after() hook, to avoid blocking all e2e tests if metrics emission is unstable. It follows the established pattern using GetMetricsFromPod with polling (10s interval, 5min timeout). Ref: CNTRLPLANE-2775 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4946853 to
db0c022
Compare
|
/lgtm yolo |
|
/test e2e-aws |
|
/test e2e-aws-4-22 |
|
/lgtm |
|
Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage. |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
@enxebre: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
I have all the evidence I need. Let me produce the final report. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe Root CauseAWS EC2 API Rate Limiting (Infrastructure Flake — Unrelated to PR) The e2e test suite runs 20 tests in parallel ( The failure sequence was:
The PR's changes (adding KAS metrics in Recommendations
Evidence
|
Description
Instruments the existing
healthCheckKASEndpoint()function in the control-plane-operator to expose two new Prometheus metrics for KAS health monitoring:hypershift_kube_apiserver_available/healthzreturns HTTP 200, 0 otherwisehypershift_kube_apiserver_request_duration_seconds/healthzprobe (buckets: 0.01–10s)Why
HCP offerings (ROSA HCP, ARO HCP) need to monitor customer API endpoint availability and latency for SLA purposes. ROSA HCP currently relies on an external tool (route-monitor-operator) solely for this. These native metrics eliminate that dependency.
How
HostedControlPlaneAvailablecondition logic is unchanged — metrics are a side-effect, not a replacementKey files
control-plane-operator/controllers/hostedcontrolplane/kas/metrics.go— metric definitions and registrationcontrol-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go— instrumented health checkcontrol-plane-operator/main.go— metrics initializationTesting
GetMetricsFromPod/ValidateMetricPresence(Karpenter pattern)make testexits 0)Jira
CNTRLPLANE-2775
🤖 Generated with Claude Code via
/jira:solve [CNTRLPLANE-2775](https://issues.redhat.com/browse/CNTRLPLANE-2775)Summary by CodeRabbit
New Features
Tests
Chores