Skip to content

OCPBUGS-84269: fix(nodepool): skip CNI-internal IPs in ClusterNetworkCIDRConflict check#8230

Open
amasolov wants to merge 1 commit into
openshift:mainfrom
amasolov:fix-cidr-conflict-false-positive
Open

OCPBUGS-84269: fix(nodepool): skip CNI-internal IPs in ClusterNetworkCIDRConflict check#8230
amasolov wants to merge 1 commit into
openshift:mainfrom
amasolov:fix-cidr-conflict-false-positive

Conversation

@amasolov
Copy link
Copy Markdown

@amasolov amasolov commented Apr 14, 2026

What this PR does / why we need it:

Since CAPK began exposing all VMI interface IPs for dual-stack support (kubernetes-sigs/cluster-api-provider-kubevirt#366), KubeVirt machines report OVN-Kubernetes management port addresses alongside the real infrastructure address. These OVN IPs fall within the hosted cluster's own clusterNetwork CIDR by design, causing setCIDRConflictCondition to raise a false-positive ClusterNetworkCIDRConflict condition on every KubeVirt NodePool.

This PR fixes the check to only report a collision when ALL of a machine's non-link-local addresses are inside the cluster network. When a machine also has addresses outside the cluster network, the in-network addresses are treated as expected CNI-internal IPs. Additionally:

  • Deduplicate addresses per machine (same IP reported as both InternalIP and ExternalIP)
  • Skip link-local addresses (169.254.0.0/16, fe80::/10)
  • Clear the condition when no conflicts remain

Which issue(s) this PR fixes:

Fixes #8229

Special notes for your reviewer:

The root cause is an interaction between two independent changes:

  1. PR CNV-40687: Detect machine and cluster-network cidr collision #3880 (April 2024) added the ClusterNetworkCIDRConflict condition, checking all MachineInternalIP/MachineExternalIP addresses against the clusterNetwork CIDR.
  2. CAPK PR add leases rbac for leader election of hypershift operator #366 (January 2026) changed KubevirtMachine address reporting to include all VMI interface IPs for dual-stack CSR approval support.

On KubeVirt guests running OVN-Kubernetes, the VMI now exposes the ovn-k8s-mp0 management port IP (which is by design within the pod CIDR) as a MachineInternalIP. The original check did not anticipate this because CAPK previously only reported the single primary DHCP-assigned address.

The fix uses a heuristic: if a machine has at least one non-link-local address outside the cluster network, then any in-network addresses are assumed to be CNI-internal and are not flagged. A real conflict is only reported when ALL addresses are within the cluster network.

The pre-existing build failure in secret_janitor_test.go (missing releaseinfo.NewMockProviderWithRegistryOverrides) is unrelated to this change.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes

    • Improved cluster network CIDR conflict detection: evaluate all configured cluster CIDRs for each machine, deduplicate addresses, ignore link-local/multicast/non-routable IPs, suppress in-CIDR conflict messages when a machine also has any qualifying out-of-CIDR address, and automatically clear the conflict condition when none remain.
  • Tests

    • Added unit tests covering mixed inside/outside addresses, link-local/non-routable cases, and dual-stack (IPv4/IPv6) scenarios to validate detection and suppression behavior.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 14, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@amasolov: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

  • Since CAPK began exposing all VMI interface IPs for dual-stack support (kubernetes-sigs/cluster-api-provider-kubevirt#366), KubeVirt machines report OVN-Kubernetes management port addresses alongside the real infrastructure address. These OVN IPs fall within the hosted cluster's own clusterNetwork CIDR by design, causing setCIDRConflictCondition to raise a false-positive ClusterNetworkCIDRConflict condition on every KubeVirt NodePool.
  • Fix the check to only report a collision when ALL of a machine's non-link-local addresses are inside the cluster network. When a machine also has addresses outside the cluster network, the in-network addresses are treated as expected CNI-internal IPs.
  • Deduplicate addresses per machine, skip link-local addresses, and clear the condition when no conflicts remain.

Which issue(s) this PR fixes

Fixes #8229

Test plan

  • Existing "machine cidr collision" test still passes (real conflict detected when all machine IPs are in cluster network)
  • New test: machine with addresses both inside and outside cluster network (no false positive)
  • New test: machine with link-local + CNI-internal + external addresses (no false positive)
  • Verify on KubeVirt hosted cluster with OVN-Kubernetes that ClusterNetworkCIDRConflict is no longer set

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Made with Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The NodePool controller's setCIDRConflictCondition was changed to parse all cluster-network CIDR prefixes, de-duplicate per-machine IPs, and ignore link-local and multicast addresses. For each machine it collects per-address collisions but only reports a CIDR conflict if a machine has no non-link-local addresses outside the cluster-network CIDRs (i.e., all applicable addresses are within the cluster-network). When no conflicts remain the function now explicitly clears the NodePoolClusterNetworkCIDRConflictType condition. Two new tests cover mixed-address and dual-stack scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Controller as NodePoolController
    participant NP as NodePool (status)
    participant M as Machine (addresses)
    participant CN as ClusterNetwork (CIDRs)
    Controller->>CN: Parse all ClusterNetwork CIDR prefixes
    Controller->>M: Read machine addresses
    M-->>Controller: Return deduplicated, filtered addresses (exclude link-local/multicast)
    Controller->>Controller: For each address, check membership against parsed CIDRs
    alt All non-link-local addresses are within ClusterNetwork CIDRs
        Controller->>NP: Set NodePoolClusterNetworkCIDRConflictType = True (with per-address messages)
    else Some addresses exist outside ClusterNetwork CIDRs
        Controller->>NP: Do not emit in-CIDR collision messages
    end
    alt No collision messages collected
        Controller->>NP: Remove NodePoolClusterNetworkCIDRConflictType condition
    end
Loading
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The PR fully addresses issue #8229's requirements: deduplicating addresses, skipping link-local addresses, treating in-network addresses as benign when machines have external addresses, and clearing the condition when resolved.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing the fix for issue #8229 in the CIDR conflict condition logic and corresponding test coverage.
Stable And Deterministic Test Names ✅ Passed The pull request introduces two standard Go test functions with stable, deterministic names: TestSetMachineAndNodeConditions (expanded) and TestSetCIDRConflictConditionDualStack (new). Both are table-driven tests using t.Run() with test case names that are purely descriptive and static, such as "no cluster-api machines," "When all addresses are inside cluster networks it should report dual-stack collision," and "When machine has out-of-cluster address alongside link-local and in-cluster addresses it should not report cidr collision." None of these test names contain dynamic elements like IP addresses, generated suffixes, timestamps, UUIDs, node names, or namespace names. The test names clearly describe what scenarios are being validated without including implementation details or dynamic values.
Test Structure And Quality ✅ Passed Both test functions demonstrate high-quality test engineering with table-driven patterns, single responsibility per test case, proper isolation, meaningful assertion messages on critical checks, and adherence to Go testing idioms.
Microshift Test Compatibility ✅ Passed Tests added are standard Go unit tests using testing.T framework, not Ginkgo e2e tests. The custom check applies only to new Ginkgo e2e tests, which are not present in this PR.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The new tests are standard Go unit tests using func Test*(*testing.T) pattern, not Ginkgo e2e tests, and do not make cluster-level topology assumptions.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only CIDR conflict detection logic in NodePool controller with no scheduling constraints, topology assumptions, or pod placement rules introduced.
Ote Binary Stdout Contract ✅ Passed The pull request does not violate the OTE Binary Stdout Contract. The only fmt usage in the modified setCIDRConflictCondition function is fmt.Sprintf() at line 816, which performs string formatting and returns a value—it does not write to stdout.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The custom check applies to Ginkgo e2e tests with patterns like It(), Describe(), Context(), When(), etc. The tests added (TestSetMachineAndNodeConditions and TestSetCIDRConflictConditionDualStack) are standard Go unit tests using func TestXXX(t *testing.T) patterns, so the IPv6 and disconnected network compatibility check is not applicable.
Title check ✅ Passed The PR title accurately reflects the main change: fixing ClusterNetworkCIDRConflict checks by skipping CNI-internal IPs, which directly addresses the root cause of false positives from dual-stack IP reporting.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release labels Apr 14, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 14, 2026

Hi @amasolov. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot requested review from csrwng and enxebre April 14, 2026 05:16
@amasolov amasolov marked this pull request as draft April 14, 2026 05:16
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 14, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
hypershift-operator/controllers/nodepool/nodepool_controller_test.go (1)

1288-1393: Please add a dual-stack regression case here.

These additions still run with a single ClusterNetwork entry, so they would pass even if setCIDRConflictCondition misclassified an address from ClusterNetwork[1] as "outside" and hid a real collision. A two-entry cluster-network fixture would lock down the CAPK dual-stack scenario this PR is targeting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/nodepool/nodepool_controller_test.go` around
lines 1288 - 1393, Add a regression test case that uses a dual-stack
ClusterNetwork (two entries: one IPv4 CIDR and one IPv6 CIDR) so the
table-driven tests exercise setCIDRConflictCondition against ClusterNetwork[0]
and ClusterNetwork[1]; locate the table entries near the existing cases ("no
cidr collision when ...") in nodepool_controller_test.go, add a new case whose
machinesGenerator produces addresses that fall into both the IPv4 and IPv6
cluster CIDRs (and some outside) and assert expectedCIDRCollision and
expectedAllMachine/expectedAllNodes accordingly; ensure the test setup that
builds the ClusterNetwork fixture used by the reconciler (the same variable used
by the surrounding tests) is updated to include two ClusterNetwork entries so
the new case validates dual-stack behavior and prevents misclassification
between ClusterNetwork[0] and ClusterNetwork[1].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hypershift-operator/controllers/nodepool/conditions.go`:
- Around line 748-752: The code currently sets hasAddressOutsideClusterNetwork
by testing addr.Address only against hc.Spec.Networking.ClusterNetwork[0], which
fails on dual-stack clusters; update the logic in the function that builds
conflictingIPs and sets hasAddressOutsideClusterNetwork so that for each
non-link-local addr.Address you iterate all entries in
hc.Spec.Networking.ClusterNetwork, parse each CIDR and test Contains(addr.IP),
and only set hasAddressOutsideClusterNetwork = true when none of the
cluster-network CIDRs contain the address; apply the same multi-CIDR containment
check to the other identical block referenced around the
hasAddressOutsideClusterNetwork/conflictingIPs handling so both places use all
ClusterNetwork CIDRs rather than index 0.

---

Nitpick comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller_test.go`:
- Around line 1288-1393: Add a regression test case that uses a dual-stack
ClusterNetwork (two entries: one IPv4 CIDR and one IPv6 CIDR) so the
table-driven tests exercise setCIDRConflictCondition against ClusterNetwork[0]
and ClusterNetwork[1]; locate the table entries near the existing cases ("no
cidr collision when ...") in nodepool_controller_test.go, add a new case whose
machinesGenerator produces addresses that fall into both the IPv4 and IPv6
cluster CIDRs (and some outside) and assert expectedCIDRCollision and
expectedAllMachine/expectedAllNodes accordingly; ensure the test setup that
builds the ClusterNetwork fixture used by the reconciler (the same variable used
by the surrounding tests) is updated to include two ClusterNetwork entries so
the new case validates dual-stack behavior and prevents misclassification
between ClusterNetwork[0] and ClusterNetwork[1].
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 7d09322d-a8cc-4d0b-a128-5fd42c1590b8

📥 Commits

Reviewing files that changed from the base of the PR and between dd2dae7 and 3cbe858.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/nodepool/conditions.go
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go

Comment thread hypershift-operator/controllers/nodepool/conditions.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
hypershift-operator/controllers/nodepool/nodepool_controller_test.go (1)

1341-1393: Test case name could be more precise.

The test name says "only link-local and CNI-internal addresses are present" but the machine actually includes 192.168.1.10 which is neither link-local nor CNI-internal—it's a valid infrastructure IP outside the cluster network. The test logic is correct, but the name is misleading.

Consider renaming to something like:
"no cidr collision when machine has out-of-cluster address alongside link-local and in-cluster addresses"

♻️ Suggested name change
-		name: "no cidr collision when only link-local and CNI-internal addresses are present",
+		name: "no cidr collision when machine has out-of-cluster address alongside link-local and in-cluster addresses",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/nodepool/nodepool_controller_test.go` around
lines 1341 - 1393, The test case name string is misleading (it claims "only
link-local and CNI-internal addresses" while the Machine includes 192.168.1.10);
update the test case's name literal in the table-driven test to a precise
description such as "no cidr collision when machine has out-of-cluster address
alongside link-local and in-cluster addresses" so it matches the Machine created
by the machinesGenerator and helps future readers locate the case (look for the
test entry with the machinesGenerator creating a capiv1.Machine and
expectedCIDRCollision nil).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller_test.go`:
- Around line 1341-1393: The test case name string is misleading (it claims
"only link-local and CNI-internal addresses" while the Machine includes
192.168.1.10); update the test case's name literal in the table-driven test to a
precise description such as "no cidr collision when machine has out-of-cluster
address alongside link-local and in-cluster addresses" so it matches the Machine
created by the machinesGenerator and helps future readers locate the case (look
for the test entry with the machinesGenerator creating a capiv1.Machine and
expectedCIDRCollision nil).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 3df4795b-9b85-47c0-bd04-fe164a8c1006

📥 Commits

Reviewing files that changed from the base of the PR and between 3cbe858 and 8acfbf5.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/nodepool/conditions.go
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 23, 2026
@amasolov amasolov force-pushed the fix-cidr-conflict-false-positive branch from d48e033 to 6066c7d Compare April 23, 2026 13:54
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 23, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
hypershift-operator/controllers/nodepool/nodepool_controller_test.go (1)

2398-2490: Consider asserting the conflict message content for dual-stack cases.

The dual-stack test only checks cond.Status == ConditionTrue when a conflict is expected. A regression where only one family's CIDR is reported (e.g., the IPv6 prefix is silently dropped from the message, or the v4 CIDR is matched against a v6 address) would still pass. Since the whole point of this test is dual-stack correctness, it’s worth asserting the message mentions both the offending IP and the corresponding CIDR per case (e.g., 10.128.0.5 with 10.128.0.0/14, and fd01::5 with fd01::/48 for case 1; fd01::a with fd01::/48 for case 3).

🧪 Suggested tightening
 			if tc.expectConflict {
 				gg.Expect(cond).ToNot(BeNil(), "expected CIDR conflict condition to be set")
 				gg.Expect(cond.Status).To(Equal(corev1.ConditionTrue))
+				for _, want := range tc.expectMessageContains {
+					gg.Expect(cond.Message).To(ContainSubstring(want))
+				}
 			} else {

with an added expectMessageContains []string field populated per case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/nodepool/nodepool_controller_test.go` around
lines 2398 - 2490, The test TestSetCIDRConflictDualStack currently only asserts
that a CIDR conflict condition exists and is True; update the table-driven cases
to include an expectMessageContains []string for each case and, after calling
r.setCIDRConflictCondition and retrieving cond via
FindStatusCondition(np.Status.Conditions,
hyperv1.NodePoolClusterNetworkCIDRConflictType), when tc.expectConflict is true
also assert that cond.Message contains each substring in
tc.expectMessageContains (e.g., "10.128.0.5" and "10.128.0.0/14" and "fd01::5"
and "fd01::/48" for the first case, and "fd01::a" and "fd01::/48" for the
third); leave the no-conflict branch asserting cond is nil. This change keeps
the focus on setCIDRConflictCondition and tightens dual-stack message
verification without changing production code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller_test.go`:
- Around line 2398-2490: The test TestSetCIDRConflictDualStack currently only
asserts that a CIDR conflict condition exists and is True; update the
table-driven cases to include an expectMessageContains []string for each case
and, after calling r.setCIDRConflictCondition and retrieving cond via
FindStatusCondition(np.Status.Conditions,
hyperv1.NodePoolClusterNetworkCIDRConflictType), when tc.expectConflict is true
also assert that cond.Message contains each substring in
tc.expectMessageContains (e.g., "10.128.0.5" and "10.128.0.0/14" and "fd01::5"
and "fd01::/48" for the first case, and "fd01::a" and "fd01::/48" for the
third); leave the no-conflict branch asserting cond is nil. This change keeps
the focus on setCIDRConflictCondition and tightens dual-stack message
verification without changing production code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 375d5a36-3e24-4365-a6f2-ad6265b9cb5c

📥 Commits

Reviewing files that changed from the base of the PR and between d48e033 and 6066c7d.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/nodepool/conditions.go
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go
✅ Files skipped from review due to trivial changes (1)
  • hypershift-operator/controllers/nodepool/conditions.go

@amasolov amasolov force-pushed the fix-cidr-conflict-false-positive branch from 6066c7d to 3d65d38 Compare April 23, 2026 15:16
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
hypershift-operator/controllers/nodepool/nodepool_controller_test.go (1)

2398-2496: Dual-stack test is well-structured; consider one small addition.

The three scenarios (all-in, mixed-with-outside, IPv6-only-in-second-CIDR) cover the multi-prefix matching and the suppression rule cleanly, and the assertions on message substrings (Lines 2426, 2465) correctly validate both the IP and the matched cluster-network CIDR are reported.

Optional: add a fourth case where the machine has an address in the IPv4 cluster CIDR and another IPv4 address outside both CIDRs (dual-stack configured but machine only reports IPv4). This would directly guard against regressions of the original false-positive scenario on single-family hosts against a dual-stack ClusterNetwork.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/nodepool/nodepool_controller_test.go` around
lines 2398 - 2496, Add a fourth table-driven case inside
TestSetCIDRConflictDualStack to cover the single-family IPv4-only host against a
dual-stack ClusterNetwork: in the test vector for
NodePoolReconciler.setCIDRConflictCondition provide clusterNetwork entries for
an IPv4 CIDR (e.g., 10.128.0.0/14) and an IPv6 CIDR, and a machines entry whose
Status.Addresses contains two IPv4 addresses (one inside the IPv4 CIDR and one
outside both CIDRs); set expectConflict to false and assert no CIDR conflict
condition is set — this guards against false positives for IPv4-only machines
when the cluster is dual-stack.
hypershift-operator/controllers/nodepool/conditions.go (1)

750-847: Heuristic implementation looks correct; flag one robustness nit on dedup keys.

The rewrite correctly:

  • parses all ClusterNetwork[*].CIDR (addresses the prior multi-CIDR review),
  • deduplicates addresses per machine,
  • skips link-local unicast/multicast (covers 169.254.0.0/16 and fe80::/10),
  • suppresses in-network "conflicts" when the machine also has a non-link-local address outside every cluster CIDR (the PR's intended heuristic for OVN-K management-port IPs on KubeVirt), and
  • explicitly removes the condition when no conflicts remain.

Minor: the seen map is keyed by the raw addr.Address string (Line 781), so two semantically identical IPv6 strings (e.g., fd01::0001 and fd01::1) would bypass dedup. Since ipaddr is already parsed a couple of lines below, keying on ipaddr (or its canonical String()) after parsing would be strictly more robust. In practice CAPK typically emits a single canonical form, so this is low impact.

♻️ Optional: dedup on the parsed address
 		seen := make(map[string]struct{})
 		...
-			if _, ok := seen[addr.Address]; ok {
-				continue
-			}
-			seen[addr.Address] = struct{}{}
-
 			ipaddr, err := netip.ParseAddr(addr.Address)
 			if err != nil {
 				return err
 			}
+			key := ipaddr.String()
+			if _, ok := seen[key]; ok {
+				continue
+			}
+			seen[key] = struct{}{}
 			if ipaddr.IsLinkLocalUnicast() || ipaddr.IsLinkLocalMulticast() {
 				continue
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/nodepool/conditions.go` around lines 750 -
847, Issue: the per-machine dedupe map in setCIDRConflictCondition currently
keys by the raw addr.Address string so different textual IPv6 forms can bypass
dedup. Fix: after parsing addr.Address into ipaddr (via netip.ParseAddr) use the
canonical form ipaddr.String() as the key in the seen map (replace uses of
seen[addr.Address] with seen[ipaddr.String()]) so deduplication is done on the
parsed/canonical IP; update the seen map creation and both the presence check
and insertion points around the parsing logic in setCIDRConflictCondition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@hypershift-operator/controllers/nodepool/conditions.go`:
- Around line 750-847: Issue: the per-machine dedupe map in
setCIDRConflictCondition currently keys by the raw addr.Address string so
different textual IPv6 forms can bypass dedup. Fix: after parsing addr.Address
into ipaddr (via netip.ParseAddr) use the canonical form ipaddr.String() as the
key in the seen map (replace uses of seen[addr.Address] with
seen[ipaddr.String()]) so deduplication is done on the parsed/canonical IP;
update the seen map creation and both the presence check and insertion points
around the parsing logic in setCIDRConflictCondition.

In `@hypershift-operator/controllers/nodepool/nodepool_controller_test.go`:
- Around line 2398-2496: Add a fourth table-driven case inside
TestSetCIDRConflictDualStack to cover the single-family IPv4-only host against a
dual-stack ClusterNetwork: in the test vector for
NodePoolReconciler.setCIDRConflictCondition provide clusterNetwork entries for
an IPv4 CIDR (e.g., 10.128.0.0/14) and an IPv6 CIDR, and a machines entry whose
Status.Addresses contains two IPv4 addresses (one inside the IPv4 CIDR and one
outside both CIDRs); set expectConflict to false and assert no CIDR conflict
condition is set — this guards against false positives for IPv4-only machines
when the cluster is dual-stack.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 71fabf99-e8f6-40af-b502-75a634ade8f0

📥 Commits

Reviewing files that changed from the base of the PR and between 6066c7d and 3d65d38.

📒 Files selected for processing (2)
  • hypershift-operator/controllers/nodepool/conditions.go
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go

@amasolov amasolov force-pushed the fix-cidr-conflict-false-positive branch from 3d65d38 to 14079a4 Compare April 23, 2026 23:40
@amasolov amasolov marked this pull request as ready for review April 29, 2026 04:18
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2026
@openshift-ci openshift-ci Bot requested review from muraee and sdminonne April 29, 2026 04:18
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

Stale PRs are closed after 21d of inactivity.

If this PR is still relevant, comment to refresh it or remove the stale label.
Mark the PR as fresh by commenting /remove-lifecycle stale.

If this PR is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 29, 2026
@amasolov
Copy link
Copy Markdown
Author

/remove-lifecycle stale

@amasolov amasolov changed the title NO-JIRA: fix(nodepool): skip CNI-internal IPs in ClusterNetworkCIDRConflict check OCPBUGS-84269: fix(nodepool): skip CNI-internal IPs in ClusterNetworkCIDRConflict check May 29, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 29, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@amasolov: This pull request references Jira Issue OCPBUGS-84269, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

Since CAPK began exposing all VMI interface IPs for dual-stack support (kubernetes-sigs/cluster-api-provider-kubevirt#366), KubeVirt machines report OVN-Kubernetes management port addresses alongside the real infrastructure address. These OVN IPs fall within the hosted cluster's own clusterNetwork CIDR by design, causing setCIDRConflictCondition to raise a false-positive ClusterNetworkCIDRConflict condition on every KubeVirt NodePool.

This PR fixes the check to only report a collision when ALL of a machine's non-link-local addresses are inside the cluster network. When a machine also has addresses outside the cluster network, the in-network addresses are treated as expected CNI-internal IPs. Additionally:

  • Deduplicate addresses per machine (same IP reported as both InternalIP and ExternalIP)
  • Skip link-local addresses (169.254.0.0/16, fe80::/10)
  • Clear the condition when no conflicts remain

Which issue(s) this PR fixes:

Fixes #8229

Special notes for your reviewer:

The root cause is an interaction between two independent changes:

  1. PR CNV-40687: Detect machine and cluster-network cidr collision #3880 (April 2024) added the ClusterNetworkCIDRConflict condition, checking all MachineInternalIP/MachineExternalIP addresses against the clusterNetwork CIDR.
  2. CAPK PR add leases rbac for leader election of hypershift operator #366 (January 2026) changed KubevirtMachine address reporting to include all VMI interface IPs for dual-stack CSR approval support.

On KubeVirt guests running OVN-Kubernetes, the VMI now exposes the ovn-k8s-mp0 management port IP (which is by design within the pod CIDR) as a MachineInternalIP. The original check did not anticipate this because CAPK previously only reported the single primary DHCP-assigned address.

The fix uses a heuristic: if a machine has at least one non-link-local address outside the cluster network, then any in-network addresses are assumed to be CNI-internal and are not flagged. A real conflict is only reported when ALL addresses are within the cluster network.

The pre-existing build failure in secret_janitor_test.go (missing releaseinfo.NewMockProviderWithRegistryOverrides) is unrelated to this change.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes

  • Improved cluster network CIDR conflict detection: evaluate all configured cluster CIDRs for each machine, deduplicate addresses, ignore link-local/multicast/non-routable IPs, suppress in-CIDR conflict messages when a machine also has any qualifying out-of-CIDR address, and automatically clear the conflict condition when none remain.

  • Tests

  • Added unit tests covering mixed inside/outside addresses, link-local/non-routable cases, and dual-stack (IPv4/IPv6) scenarios to validate detection and suppression behavior.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 29, 2026
@amasolov amasolov force-pushed the fix-cidr-conflict-false-positive branch from 14079a4 to cdd2396 Compare May 29, 2026 12:53
@amasolov
Copy link
Copy Markdown
Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 29, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@amasolov: This pull request references Jira Issue OCPBUGS-84269, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox
Copy link
Copy Markdown
Member

/ok-to-test

@openshift-ci openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 29, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

❌ Patch coverage is 92.50000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.72%. Comparing base (1146767) to head (0c35447).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
...rshift-operator/controllers/nodepool/conditions.go 92.50% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8230      +/-   ##
==========================================
+ Coverage   40.69%   40.72%   +0.02%     
==========================================
  Files         755      755              
  Lines       93417    93450      +33     
==========================================
+ Hits        38019    38055      +36     
+ Misses      52659    52657       -2     
+ Partials     2739     2738       -1     
Files with missing lines Coverage Δ
...rshift-operator/controllers/nodepool/conditions.go 56.08% <92.50%> (+2.14%) ⬆️
Flag Coverage Δ
cmd-support 34.70% <ø> (ø)
cpo-hostedcontrolplane 41.81% <ø> (ø)
cpo-other 41.43% <ø> (ø)
hypershift-operator 50.92% <92.50%> (+0.07%) ⬆️
other 31.62% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@bryan-cox bryan-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: CIDR Conflict False-Positive Fix

Overall this is a well-reasoned fix with a sound heuristic. The production code changes are correct and the dual-stack support is a good improvement over the original single-CIDR assumption. I have a few findings below — two blocking items around missing behavioral test coverage, plus several suggestions and nits.

Blocking (2):

  • Missing test for condition clearing (the new else removeStatusCondition path)
  • Missing test for link-local + in-cluster only addresses (no outside address)

Suggestions (3):

  • Add debug logging for suppressed CNI-internal addresses
  • Add t.Parallel() to TestSetCIDRConflictConditionDualStack and a multi-machine test case
  • Add a comment explaining why two test functions cover overlapping behavior

Nits (1):

  • Move conflictEntry type declaration outside the loop

Questions (1):

  • What is the expected behavior when a machine has only link-local addresses?

The test quality is good — proper Gherkin names, table-driven structure, gomega matchers, and good dual-stack coverage. Closing the two behavioral gaps above will make this solid.

Comment on lines +770 to +773
type conflictEntry struct {
ip string
cidr string
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] The conflictEntry type is declared inside the per-machine loop body, meaning Go re-declares the type on every iteration. While the compiler handles this fine, it's unusual and makes the type harder to find. Consider moving it above the loop (or to package level) for readability:

type conflictEntry struct {
	ip   string
	cidr string
}

// ...
for _, machine := range machines {
	var conflicting []conflictEntry
	// ...

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved to cidrConflictEntry at function scope above the loop.

Comment on lines +792 to +794
if ipaddr.IsLinkLocalUnicast() || ipaddr.IsLinkLocalMulticast() {
continue
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] What happens when a machine has zero qualifying addresses — i.e., all addresses are link-local? In that case hasAddressOutsideClusterNetwork stays false and conflicting is empty, so the !hasAddressOutsideClusterNetwork branch enters but appends nothing. The behavior is correct (no conflict message), but is a machine with only link-local addresses a valid state? If not, should we flag it differently (e.g., a warning)? If it is valid, please add a test case to document that expectation (see my test comment below).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a machine has only link-local addresses, both hasAddressOutsideClusterNetwork and conflicting remain empty, so no conflict is reported and no condition is set. This is correct because link-local addresses are not routable infrastructure IPs and should not influence the heuristic in either direction.

Added a test documenting this expectation: "When machine has only link-local addresses it should not report cidr collision"

Comment on lines +816 to 821
if !hasAddressOutsideClusterNetwork {
for _, entry := range conflicting {
messages = append(messages, fmt.Sprintf("machine [%s] with ip [%s] collides with cluster-network cidr [%s]", machine.Name, entry.ip, entry.cidr))
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] Consider emitting a structured log line at debug level when addresses are suppressed by the heuristic. Operators debugging networking issues on KubeVirt clusters might want to see that addresses were evaluated but deemed CNI-internal:

if hasAddressOutsideClusterNetwork && len(conflicting) > 0 {
	log.Log.V(4).Info("Skipping CNI-internal addresses for CIDR conflict check",
		"machine", machine.Name,
		"suppressedCount", len(conflicting))
}

This is optional but would help day-2 troubleshooting.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added a log.V(4).Info call when addresses are suppressed:

} else if len(conflicting) > 0 {
    log.V(4).Info("Skipping CNI-internal addresses for CIDR conflict check",
        "machine", machine.Name,
        "suppressedCount", len(conflicting))
}

Also added ctx to the function signature to get a structured logger via ctrl.LoggerFrom(ctx).

Comment on lines +844 to +845
} else {
removeStatusCondition(&nodePool.Status.Conditions, hyperv1.NodePoolClusterNetworkCIDRConflictType)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocking] This else branch with removeStatusCondition is new behavior — previously the condition was never cleared once set. This is a great operational improvement, but no test verifies it.

Please add a test case that:

  1. Pre-populates a NodePool with an existing ClusterNetworkCIDRConflictType condition (Status=True)
  2. Calls setCIDRConflictCondition with machines that have addresses outside the cluster network (so no conflict should be reported)
  3. Asserts the condition is removed from nodePool.Status.Conditions

Example name: "When previously-conflicting machine gains out-of-cluster address it should clear the cidr conflict condition"

Without this test, a regression that breaks condition clearing would go undetected.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added test case: "When previously-conflicting machine gains out-of-cluster address it should clear the cidr conflict condition"

It pre-populates the NodePool with an existing ClusterNetworkCIDRConflictType=True condition, then calls setCIDRConflictCondition with a machine that has an outside address, and asserts the condition is removed.

Comment on lines +2073 to 2131
name: "When machine has out-of-cluster address alongside link-local and in-cluster addresses it should not report cidr collision",
machinesGenerator: func() []client.Object {
return []client.Object{
&capiv1.Machine{
ObjectMeta: metav1.ObjectMeta{
Name: "node1",
Namespace: "myns-cluster-name",
Annotations: map[string]string{
nodePoolAnnotation: "myns/np-name",
},
},
Status: capiv1.MachineStatus{
Conditions: []capiv1.Condition{
{
Type: capiv1.ReadyCondition,
Status: corev1.ConditionTrue,
},
{
Type: capiv1.MachineNodeHealthyCondition,
Status: corev1.ConditionTrue,
},
},
Addresses: capiv1.MachineAddresses{
{
Type: capiv1.MachineInternalIP,
Address: "192.168.1.10",
},
{
Type: capiv1.MachineInternalIP,
Address: "169.254.0.2",
},
{
Type: capiv1.MachineInternalIP,
Address: "fe80::1",
},
{
Type: capiv1.MachineInternalIP,
Address: "10.10.10.2",
},
},
},
},
}
},
expectedAllMachine: &testCondition{
Status: corev1.ConditionTrue,
Reason: hyperv1.AsExpectedReason,
Messages: []string{hyperv1.AllIsWellMessage},
},
expectedAllNodes: &testCondition{
Status: corev1.ConditionTrue,
Reason: hyperv1.AsExpectedReason,
Messages: []string{hyperv1.AllIsWellMessage},
},
expectedCIDRCollision: nil,
},
} {
t.Run(tc.name, func(tt *testing.T) {
gg := NewWithT(tt)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocking] This test case verifies two behaviors simultaneously — link-local address skipping AND out-of-cluster address suppression — because the machine has 192.168.1.10 (outside), 169.254.0.2 (link-local v4), fe80::1 (link-local v6), and 10.10.10.2 (in-cluster). Since 192.168.1.10 is outside, the heuristic suppresses the conflict regardless of whether link-local filtering works.

There is no test that verifies link-local addresses are excluded from the "outside" determination. Please add a case where a machine has only link-local + in-cluster addresses (no outside address):

{
	name: "When machine has only link-local and in-cluster addresses it should report cidr collision",
	// Addresses: 169.254.0.2 (link-local), fe80::1 (link-local), 10.10.10.2 (in-cluster)
	// Expected: conflict IS reported because link-local doesn't count as "outside"
}

This tests the behavioral contract that link-local addresses are truly excluded from the heuristic, not just skipped during iteration. Without it, a bug that counts link-local as "outside" would silently suppress real conflicts.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added a dedicated test case: "When machine has only link-local and in-cluster addresses it should report cidr collision"

This machine has only 169.254.0.2 (link-local v4), fe80::1 (link-local v6), and 10.128.0.5 (in-cluster), with no outside address. Asserts that the conflict IS reported, verifying that link-local addresses are truly excluded from the "outside" determination and do not silently suppress real conflicts.

Comment on lines +2416 to +2417
func TestSetCIDRConflictConditionDualStack(t *testing.T) {
r := NodePoolReconciler{}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] A few improvements for this test function:

  1. Add t.Parallel(): This function creates fresh NodePool, HostedCluster, and NodePoolReconciler{} per subtest with no shared mutable state. It should be parallelized, consistent with TestSetMachineAndNodeConditions which calls t.Parallel().

  2. Add a multi-machine test case: All test cases use a single machine. Please add a case with 2+ machines where one has a real conflict (all addresses in-cluster) and another is suppressed (has an outside address). This verifies the per-machine logic doesn't leak across machines.

  3. [nit] This function tests setCIDRConflictCondition directly while the two cases added to TestSetMachineAndNodeConditions above test it indirectly via the parent setMachineAndNodeConditions. The split is reasonable (isolated unit test vs integration), but consider adding a brief comment at the top explaining why both exist — future maintainers may wonder whether they're redundant.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three done:

  1. Added t.Parallel() at the function level and tt.Parallel() in each subtest.
  2. Added multi-machine test case: "When multiple machines have mixed conflict states it should only report the conflicting one" (one machine with all addresses in-cluster, another with an outside address, verifies per-machine isolation).
  3. Added a comment at the top of TestSetCIDRConflictConditionDualStack explaining why both this direct unit test and the integration-level cases in TestSetMachineAndNodeConditions exist.

@amasolov amasolov force-pushed the fix-cidr-conflict-false-positive branch from cdd2396 to a28c276 Compare May 29, 2026 23:59
@amasolov
Copy link
Copy Markdown
Author

amasolov commented Jun 2, 2026

@bryan-cox does it look better now?

@bryan-cox
Copy link
Copy Markdown
Member

/ok-to-test

@bryan-cox
Copy link
Copy Markdown
Member

/approve

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 2, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amasolov, bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 2, 2026
@hypershift-jira-solve-ci
Copy link
Copy Markdown

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

codecov/project: 40.70% (-5.14%) compared to 9b67f7b

Summary

The codecov/project check failed because project-wide coverage dropped from 45.84% (base commit 9b67f7b) to 40.70% on the PR, a -5.14% decrease. This is not caused by the PR's code changes — the PR has excellent 92.50% patch coverage (37/40 lines covered) and codecov/patch passed. The drop is a phantom caused by a known Codecov shard-upload timing issue in the hypershift repository: unit tests run in 5 parallel shards, each uploading separate coverage data, and without carryforward flags the project total swings wildly depending on upload order. The base report saw 440 files / 52,824 lines while the PR report saw 755 files / 93,401 lines — a difference of 315 files and 40,577 lines that is impossible from a 2-file change. PR #8642, merged earlier today, adds the missing carryforward: true flags to codecov.yml to fix exactly this problem, but PR #8230's branch predates that fix.

Root Cause

The root cause is a missing flags: section with carryforward: true in PR #8230's version of codecov.yml.

The hypershift CI runs unit tests in 5 parallel shards (cpo-hostedcontrolplane, cpo-other, hypershift-operator, cmd-support, other), each uploading a separate coverage report to Codecov. Without carryforward: true flag definitions, Codecov recomputes the project total after each shard upload and posts intermediate statuses. When shards upload in different orders between the base commit and the PR commit, coverage totals fluctuate by as much as 5%+ — as documented in PR #8642 where the same commit adfbcddc2e on main swung from 41.80% → 47.18% → 45.84% within 10 minutes.

The Codecov report itself confirms the anomaly: it is "36 commits behind head on main" and shows 315 new files (+40,577 lines) between the base and PR — this is impossible for a 2-file change. The base commit 9b67f7b had a partial/favorable shard upload order (440 files, 45.84%), while the PR commit a28c276 got a different combination (755 files, 40.70%), creating the phantom -5.14% drop.

PR #8642 ("Add codecov carryforward flags to stabilize project coverage checks") was merged to main at 11:27 UTC today to fix this exact issue. However, PR #8230's branch (commit a28c276 from April 14) does not contain this fix and still uses the old codecov.yml without the flags: section.

Recommendations
  1. Rebase PR OCPBUGS-84269: fix(nodepool): skip CNI-internal IPs in ClusterNetworkCIDRConflict check #8230 onto main to pick up the codecov.yml carryforward fix from PR CNTRLPLANE-3535: Add codecov carryforward flags to stabilize project coverage checks #8642. This will resolve the phantom project coverage drop.

    git fetch origin main
    git rebase origin/main
    git push --force-with-lease
  2. No code changes needed — the PR's actual patch coverage is 92.50%, which is excellent. The 3 uncovered lines (2 missing + 1 partial in conditions.go) are minor and do not indicate a test gap.

  3. Re-run the codecov checks after rebasing. With carryforward flags in place, the project coverage should stabilize and the check should pass.

Evidence
Evidence Detail
codecov/project result failure — 40.70% (-5.14%) vs base 45.84%
codecov/patch result success — 92.50% patch coverage (37/40 lines)
Files changed by PR 2 files (conditions.go, nodepool_controller_test.go)
Files diff in coverage report +315 files, +40,577 lines (impossible from 2-file change)
Codecov report staleness "36 commits behind head on main"
Base coverage report 440 files, 52,824 lines, 45.84% coverage
PR coverage report 755 files, 93,401 lines, 40.70% coverage
PR #8230 codecov.yml No flags: section — missing carryforward configuration
Main branch codecov.yml Has flags: section with 5 carryforward flags (added by PR #8642)
PR #8642 "Add codecov carryforward flags to stabilize project coverage checks" — merged 2026-06-02T11:27:27Z
PR #8230 branch created 2026-04-14 (predates the carryforward fix)
Known shard instability Same commit adfbcddc2e on main: 41.80% → 47.18% → 45.84% within 10 min (per PR #8642)

Since CAPK began exposing all VMI interface IPs for dual-stack support
(cluster-api-provider-kubevirt#366), KubeVirt machines report
OVN-Kubernetes management port addresses (e.g. ovn-k8s-mp0) alongside
the real infrastructure address. These OVN IPs fall within the hosted
cluster's own clusterNetwork CIDR by design, causing
setCIDRConflictCondition to raise a false-positive
ClusterNetworkCIDRConflict condition on every KubeVirt NodePool.

Fix the check so that a collision is only reported when ALL of a
machine's non-link-local addresses are inside the cluster network,
which indicates the machine's infrastructure IP genuinely overlaps
with the pod CIDR. When a machine also has addresses outside the
cluster network, the in-network addresses are treated as expected
CNI-internal IPs and skipped.

Additionally:
- Parse and check all ClusterNetwork CIDRs for dual-stack support
- Deduplicate addresses per machine using canonical IP form
- Skip link-local addresses (169.254.0.0/16, fe80::/10)
- Clear the condition when no conflicts remain
- Log suppressed CNI-internal addresses at debug level

Signed-off-by: Alexey Masolov <amasolov@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@amasolov amasolov force-pushed the fix-cidr-conflict-false-positive branch from a28c276 to 0c35447 Compare June 2, 2026 13:25
@bshirren
Copy link
Copy Markdown

bshirren commented Jun 2, 2026

/ok-to-test

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 3, 2026

@amasolov: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@amasolov
Copy link
Copy Markdown
Author

amasolov commented Jun 3, 2026

@bryan-cox @csrwng @enxebre @muraee @sdminonne review please to avoid going in circles again

@bryan-cox
Copy link
Copy Markdown
Member

@bryan-cox @csrwng @enxebre @muraee @sdminonne review please to avoid going in circles again

@amasolov I put the approve on so I'll leave it up to someone else to put the final lgtm on.

You'll also need to have this PR pre-merge tested for the verified label. Do you have someone on your team who could QE this PR?

@amasolov
Copy link
Copy Markdown
Author

amasolov commented Jun 3, 2026

@bryan-cox thank you for all your help here!

I'm a TAM, so I don't have a QE team behind me for this sort of thing. I raised and authored this PR purely to help my customer work through a real issue they're hitting in production with KubeVirt NodePools. The unit tests pass and all CI is green, but I'm not set up to run the pre-merge verification workflow myself.

I understand there are ongoing QE capacity challenges across many of our products, so I don't want to make assumptions about who can pick this up. What would be the best path forward here? Is there a QE team or individual who typically handles verification for HyperShift PRs that I could reach out to, or is there another process I should follow to get the verified label?

Happy to provide any additional context or help coordinate, just don't want this to go stale again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ClusterNetworkCIDRConflict false positive on KubeVirt NodePools after CAPK dual-stack address change

4 participants