Skip to content

OCPBUGS-92837: test/router: wait for all per-route metrics before asserting#31344

Open
mkowalski wants to merge 1 commit into
openshift:mainfrom
mkowalski:fix/router-metrics-wait-for-all-stats
Open

OCPBUGS-92837: test/router: wait for all per-route metrics before asserting#31344
mkowalski wants to merge 1 commit into
openshift:mainfrom
mkowalski:fix/router-metrics-wait-for-all-stats

Conversation

@mkowalski

@mkowalski mkowalski commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

The HAProxy router metrics test exits its retry loop as soon as
haproxy_backend_connections_total reaches the expected count, but then
immediately asserts on other per-route metrics like
haproxy_server_http_responses_total. The HAProxy exporter has a scrape
interval (typically 5s), so these metrics may not be populated in the same
scrape that satisfied the connections check.

This causes a 100% failure rate on 5.0 Azure micro-upgrade jobs
(Sippy regression #42639 /
OCPBUGS-92837) because
the post-loop assertions find nil where they expect populated gauges.

Root Cause

The retry loop at lines 164-186 waits for:

  1. haproxy_server_up to have 2 non-zero entries (both backend servers UP)
  2. haproxy_backend_connections_total >= 10 for the test route

Once satisfied, the loop exits. But the post-loop assertions immediately check
haproxy_server_http_responses_total with code=2xx — which may not yet
be populated because the HAProxy exporter scrapes stats on a 5-second interval.
The connections metric can appear in one exporter scrape cycle while the HTTP
responses metric only appears in the next one.

Fix

Add haproxy_server_http_responses_total with code=2xx to the loop exit
condition. The loop now only returns success when all per-route backend stats
are confirmed present in the same metrics scrape. This adds at most one extra
exporter scrape cycle (~5s) to the wait, well within the 240s timeout.

Verification

  • Failure signature: metrics.go:227: Expected <[]float64 | len:0, cap:0>: nil
  • All 11/11 failing runs show this pattern
  • The fix ensures the metrics map used for post-loop assertions contains all
    required per-route stats before proceeding

🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.

Summary by CodeRabbit

  • Tests
    • Improved the reliability of router metrics validation by waiting for all expected per-route statistics to appear before considering the check successful.
    • Added additional retry handling so metrics checks continue until connection and response data are fully available.

The HAProxy router metrics test exits its retry loop as soon as
haproxy_backend_connections_total reaches the expected count, but
then immediately asserts on other per-route metrics like
haproxy_server_http_responses_total. The HAProxy exporter has a
scrape interval (typically 5s), so these metrics may not be populated
in the same scrape that satisfied the connections check.

This causes a 100% failure rate on 5.0 Azure micro-upgrade jobs
(regression #42639 / OCPBUGS-92837) because the post-loop assertions
find nil where they expect populated gauges.

Fix by adding haproxy_server_http_responses_total 2xx to the loop
exit condition so we only proceed when all per-route backend stats
are confirmed present in the same metrics scrape.

Signed-off-by: Mateusz Kowalski <mko@redhat.com>
Generated-by: AI
Signed-off-by: Mateusz Kowalski <mko@redhat.com>
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Walkthrough

The HAProxy router metrics test now keeps polling until route backend connection counts and per-server 2xx response metrics are both present in the same scrape. It logs a retry message while waiting instead of succeeding as soon as backend connections appear.

Changes

HAProxy route metrics polling

Layer / File(s) Summary
Poll for route metrics
test/extended/router/metrics.go
The poll loop now waits for haproxy_backend_connections_total and haproxy_server_http_responses_total{code=2xx} to be populated together before succeeding, and logs retries while either metric is still missing.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The modified file’s Ginkgo titles are static strings; the PR changes only test body polling logic, not any dynamic or unstable test names.
Test Structure And Quality ✅ Passed The changed test keeps a single integration behavior, uses bounded PollImmediate timeouts, and follows existing router-test setup/cleanup patterns.
Microshift Test Compatibility ✅ Passed No new Ginkgo test was added; the existing router metrics spec is already skipped on MicroShift and the route test has an [apigroup:route.openshift.io] tag.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The modified router metrics test only waits for additional metrics; it has no multi-node/HA assumptions, and no node/topology checks were added.
Topology-Aware Scheduling Compatibility ✅ Passed Only router test polling logic changed; no manifests/controllers or topology-based scheduling constraints were introduced.
Ote Binary Stdout Contract ✅ Passed The PR only changes router metrics test logic; the modified file has no process-level stdout writes, init/TestMain, or klog/log stdout setup.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The changed test uses net.JoinHostPort/IPUrl for host formatting and only talks to cluster-internal router metrics; no new IPv4-only or external connectivity assumptions were added.
No-Weak-Crypto ✅ Passed The PR only updates a router metrics test retry condition; the edited code contains no weak-crypto, custom crypto, or secret-comparison logic.
Container-Privileges ✅ Passed The PR only updates test/extended/router/metrics.go; no container/K8s manifests or security-context fields like privileged, hostNetwork, or allowPrivilegeEscalation were added.
No-Sensitive-Data-In-Logs ✅ Passed The only new log message is a generic retry notice; no passwords, tokens, PII, hostnames, or customer data were added to logs.
Title check ✅ Passed The title is concise and accurately summarizes the main test change.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mkowalski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2026
@openshift-ci openshift-ci Bot requested review from bentito and knobunc June 26, 2026 13:53

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/extended/router/metrics.go (1)

168-186: 🩺 Stability & Availability | 🔵 Trivial

Use a context-aware poll here.

This loop can run for up to 240 seconds, but wait.PollImmediate won’t stop early if the spec is canceled. If this test can take g.SpecContext, switch to wait.PollUntilContextTimeout so the retry exits with the test context.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/router/metrics.go` around lines 168 - 186, The polling loop in
the metrics test uses a fixed timeout and will not exit early when the spec
context is canceled. Update the retry logic in the metrics check to use a
context-aware poll with g.SpecContext, replacing the current wait.PollImmediate
call so it respects test cancellation. Keep the existing metric validation and
retry conditions in the same callback logic, but wire them through the
context-aware polling API.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/extended/router/metrics.go`:
- Around line 168-186: The polling loop in the metrics test uses a fixed timeout
and will not exit early when the spec context is canceled. Update the retry
logic in the metrics check to use a context-aware poll with g.SpecContext,
replacing the current wait.PollImmediate call so it respects test cancellation.
Keep the existing metric validation and retry conditions in the same callback
logic, but wire them through the context-aware polling API.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: fdf94624-a83b-410c-a39c-ae02b928ab11

📥 Commits

Reviewing files that changed from the base of the PR and between 817fa8a and 465f2f7.

📒 Files selected for processing (1)
  • test/extended/router/metrics.go

@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 26, 2026
@mkowalski mkowalski changed the title test/router: wait for all per-route metrics before asserting OCPBUGS-92837: test/router: wait for all per-route metrics before asserting Jun 26, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 26, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mkowalski: This pull request references Jira Issue OCPBUGS-92837, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

The HAProxy router metrics test exits its retry loop as soon as
haproxy_backend_connections_total reaches the expected count, but then
immediately asserts on other per-route metrics like
haproxy_server_http_responses_total. The HAProxy exporter has a scrape
interval (typically 5s), so these metrics may not be populated in the same
scrape that satisfied the connections check.

This causes a 100% failure rate on 5.0 Azure micro-upgrade jobs
(Sippy regression #42639 /
OCPBUGS-92837) because
the post-loop assertions find nil where they expect populated gauges.

Root Cause

The retry loop at lines 164-186 waits for:

  1. haproxy_server_up to have 2 non-zero entries (both backend servers UP)
  2. haproxy_backend_connections_total >= 10 for the test route

Once satisfied, the loop exits. But the post-loop assertions immediately check
haproxy_server_http_responses_total with code=2xx — which may not yet
be populated because the HAProxy exporter scrapes stats on a 5-second interval.
The connections metric can appear in one exporter scrape cycle while the HTTP
responses metric only appears in the next one.

Fix

Add haproxy_server_http_responses_total with code=2xx to the loop exit
condition. The loop now only returns success when all per-route backend stats
are confirmed present in the same metrics scrape. This adds at most one extra
exporter scrape cycle (~5s) to the wait, well within the 240s timeout.

Verification

  • Failure signature: metrics.go:227: Expected <[]float64 | len:0, cap:0>: nil
  • All 11/11 failing runs show this pattern
  • The fix ensures the metrics map used for post-loop assertions contains all
    required per-route stats before proceeding

🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.

Summary by CodeRabbit

  • Tests
  • Improved the reliability of router metrics validation by waiting for all expected per-route statistics to appear before considering the check successful.
  • Added additional retry handling so metrics checks continue until connection and response data are fully available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mkowalski

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 26, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mkowalski: This pull request references Jira Issue OCPBUGS-92837, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested a review from melvinjoseph86 June 26, 2026 14:12
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves stability of the HAProxy router metrics extended test by ensuring the retry loop doesn’t exit until required per-route metrics are available, accounting for the HAProxy exporter’s scrape interval.

Changes:

  • Adds contextual comments explaining exporter scrape-interval lag for per-route metrics.
  • Extends the wait.PollImmediate exit condition to also wait for haproxy_server_http_responses_total{code="2xx"} before proceeding to post-loop assertions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +175 to +179
backendConns := findGaugesWithLabels(metrics["haproxy_backend_connections_total"], routeLabels)
if len(backendConns) > 0 && backendConns[0] >= float64(times) {
// Also verify that the HTTP response metrics have been
// populated for this route before exiting the loop.
// The exporter may not refresh all stats atomically, so

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we do it step by step and handle whenever necessary? Like this we can say it about every single metric. No?

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@mkowalski: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-fips 465f2f7 link true /test e2e-aws-ovn-fips
ci/prow/e2e-vsphere-ovn 465f2f7 link true /test e2e-vsphere-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants