Skip to content

OCPBUGS-86471: baremetal: fix hostSelector for default stream#10571

Open
honza wants to merge 1 commit into
openshift:mainfrom
honza:host-labels
Open

OCPBUGS-86471: baremetal: fix hostSelector for default stream#10571
honza wants to merge 1 commit into
openshift:mainfrom
honza:host-labels

Conversation

@honza
Copy link
Copy Markdown
Member

@honza honza commented May 25, 2026

Use a NotIn MatchExpression to exclude other known streams rather than requiring an exact label match via MatchLabels. NotIn matches when the label is absent or has any value not in the exclusion set, so unlabeled BareMetalHosts (e.g. spare hosts in a CI pool) can still be claimed while hosts labeled for a different stream are rejected.

This applies to all streams, not just the compile-time default, because the cluster's effective default stream is determined at install time, not by the installer binary's DefaultOSImageStream constant. The exclusion list is computed from types.OSImageStreamValues, so new streams are automatically excluded without additional code changes..

Summary by CodeRabbit

  • Bug Fixes
    • Restores bare-metal host selection so hosts lacking stream labels are matched by the default stream, avoiding unintended exclusions.
    • Ensures non-default OS image streams continue to require explicit label matching, preserving intended targeting.
    • Prevents deployments from being filtered out unless hosts are manually labeled, improving discovery and provisioning reliability.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 25, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@honza: This pull request references Jira Issue OCPBUGS-86471, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Commit edd7753 ("baremetal: Add coreos.openshift.io/stream label to BMH and hostSelector") unconditionally set HostSelector.MatchLabels on BareMetalMachineProviderSpec to require coreos.openshift.io/stream=rhel-9 on every BareMetalHost. This works for installer-created BMHs which get the label stamped during installation, but breaks scaling and upgrade scenarios where MachineSets need to claim pre-existing BareMetalHosts from the CI pool that don't carry this label. Since MatchLabels is a hard requirement (the label must exist AND match), unlabeled hosts are never selected and provisioning hangs.

Only set HostSelector when a non-default stream (e.g. rhel-10) is explicitly configured in the install-config. For the default rhel-9 case (or when osImageStream is unset), leave HostSelector empty so MachineSets can claim any available BMH regardless of labeling.

The stream labels on installer-created BMHs are left unchanged — they remain informational and correct.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 25, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Note

.coderabbit.yaml has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)
Validation error: Unrecognized key: "tools"
⚙️ Configuration instructions
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
📝 Walkthrough

Walkthrough

Adds an import and introduces hostSelectorForStream(stream); provider now sets BareMetalMachineProviderSpec.CustomDeploy.HostSelector by calling that helper. The helper builds an exclusion list from types.OSImageStreamValues and returns a HostSelector with a single NotIn MatchExpressions requirement keyed by streamLabelKey.

Changes

HostSelector Configuration Logic

Layer / File(s) Summary
Import and stream usage
pkg/asset/machines/baremetal/machines.go
Add k8s.io/apimachinery/pkg/selection import required by the new helper and preserve stream typing in provider.
Provider wiring to helper
pkg/asset/machines/baremetal/machines.go
Replace inline CustomDeploy.HostSelector construction with hostSelectorForStream(osImageStream) in provider.
hostSelectorForStream implementation
pkg/asset/machines/baremetal/machines.go
Add hostSelectorForStream(stream) that computes an exclusion list of types.OSImageStreamValues not equal to stream and returns a HostSelector with one MatchExpressions requirement using selection.NotIn and that exclusion list (keyed by streamLabelKey).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • openshift/installer#10502: Both PRs modify the baremetal machine provider and HostSelector logic related to coreos.openshift.io/stream.

Suggested labels

approved, lgtm

Suggested reviewers

  • dtantsur
  • iurygregory
🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: fixing the hostSelector logic for default streams in baremetal machines, which is the core purpose of this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only production code in machines.go. No Ginkgo tests present or added; file has no test patterns or Ginkgo imports. Check not applicable.
Test Structure And Quality ✅ Passed No Ginkgo test files were added or modified in this PR. The PR only changes non-test code in machines.go.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. Changes are only to production code in pkg/asset/machines/baremetal/machines.go, which contains no test definitions.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR modifies bare metal machine provisioning code, not e2e tests. No Ginkgo test declarations found; check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR changes baremetal machine provisioning logic (HostSelector for BareMetalMachineProviderSpec), not pod scheduling constraints. No pod affinity, topology spread, PDBs, or nodeSelectors introduced.
Ote Binary Stdout Contract ✅ Passed File is production asset generation code with no process-level functions, no stdout writes, and no test binary communication code affected by changes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. Changes are to baremetal machine provisioning code (pkg/asset/machines/baremetal/machines.go and machinesets.go), not test code. Check does not apply.
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons detected in the modified file.
Container-Privileges ✅ Passed PR modifies only Go provisioning code for baremetal HostSelector logic; no container security, privilege escalation, or K8s manifest configurations added.
No-Sensitive-Data-In-Logs ✅ Passed No logging of sensitive data found. The code only uses fmt for non-sensitive error messages and machine naming, with no logging frameworks or secret exposure.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from elfosardo and iurygregory May 25, 2026 19:32
@honza
Copy link
Copy Markdown
Member Author

honza commented May 25, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 25, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@honza: This pull request references Jira Issue OCPBUGS-86471, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@honza: This pull request references Jira Issue OCPBUGS-86471, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Commit edd7753 ("baremetal: Add coreos.openshift.io/stream label to BMH and hostSelector") unconditionally set HostSelector.MatchLabels on BareMetalMachineProviderSpec to require coreos.openshift.io/stream=rhel-9 on every BareMetalHost. This works for installer-created BMHs which get the label stamped during installation, but breaks scaling and upgrade scenarios where MachineSets need to claim pre-existing BareMetalHosts from the CI pool that don't carry this label. Since MatchLabels is a hard requirement (the label must exist AND match), unlabeled hosts are never selected and provisioning hangs.

Only set HostSelector when a non-default stream (e.g. rhel-10) is explicitly configured in the install-config. For the default rhel-9 case (or when osImageStream is unset), leave HostSelector empty so MachineSets can claim any available BMH regardless of labeling.

The stream labels on installer-created BMHs are left unchanged — they remain informational and correct.

Summary by CodeRabbit

Bug Fixes

  • Bug Fixes
  • Corrected bare metal host selection behavior to only apply custom filtering when explicitly configured, restoring compatibility with unlabeled bare metal hosts in deployments.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@iurygregory
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 26, 2026

// Only set HostSelector when a non-default stream is explicitly configured.
// Unlabeled BMHs (e.g. spare hosts in a CI pool) won't match a selector,
// so we leave it empty for the default stream to preserve compatibility.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this is that it will freely pick up e.g. BMHs that are explicitly labelled for rhel-10 and happily try to install rhel-9 on them.

Ideally the semantics we want are to match any BMHs either with the label value we want or without any label of that type. That doesn't seem to be possible with the Selector API, as all of the match conditions must be true. We would have to change the API to allow multiple selectors. Or k8s needs to add a NotOut operator to selectors 🙂.

What we can do is use the NotEquals or NotIn operator to exclude the non-default value (i.e. currently rhel-10) - that does still match in the case where the label is not set. Unfortunately this is not super future-proof, as it requires us to enumerate all of the values we don't want to match.

The alternative would be to not apply the label to any hosts, and explicitly match hosts without the label. So that the default pool would always use unlabelled hosts and for the non-default pools you would have to label the hosts and add a selector to the pool. I kind of hate this because as you upgrade into a version with new default streams, the correct labelling continues to depend on the cluster's born-in version.

It looks like past-me was at least dimly aware of the problem:

The hardest part will be handling the default case that there is no label. Worst case scenario, we could run a small controller that adds the default label value to any BMHs that don't have the label. It would be better not to have to, but at least we know that a viable solution exists.

So many options, all fairly bad.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we can do is use the NotEquals or NotIn operator to exclude the non-default value (i.e. currently rhel-10) - that does still match in the case where the label is not set. Unfortunately this is not super future-proof, as it requires us to enumerate all of the values we don't want to match.

By the time we get to RHEL 11, maybe it will become less of a problem :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time we get to RHEL 11, maybe it will become less of a problem :)

I guess that's true, in the sense that by the time we drop RHEL 9 support (which will be before RHEL 11), any MachineSets that exclude RHEL 10 hosts will no longer be useful and will need to be deleted.
It's less robust against us adding other kinds of streams, but we can document what to do when adding another MachineSet with a different selector.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with the NotIn idea

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 27, 2026
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 29, 2026
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from iurygregory. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@iurygregory
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 1, 2026
Comment thread pkg/asset/machines/baremetal/machines.go Outdated
Comment thread pkg/asset/machines/baremetal/machines.go Outdated
Use a NotIn MatchExpression to exclude other known streams rather than
requiring an exact label match via MatchLabels. NotIn matches when the
label is absent or has any value not in the exclusion set, so
unlabeled BareMetalHosts (e.g. spare hosts in a CI pool) can still be
claimed while hosts labeled for a different stream are rejected.

This applies to all streams, not just the compile-time default,
because the cluster's effective default stream is determined at
install time, not by the installer binary's DefaultOSImageStream
constant. The exclusion list is computed from
types.OSImageStreamValues, so new streams are automatically excluded
without additional code changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 4, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 4, 2026

@honza: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-single-node-live-iso b74347a link false /test e2e-metal-single-node-live-iso
ci/prow/e2e-metal-ovn-two-node-fencing b74347a link false /test e2e-metal-ovn-two-node-fencing

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dtantsur
Copy link
Copy Markdown
Member

dtantsur commented Jun 5, 2026

/payload-job pull-ci-openshift-cluster-baremetal-operator-main-e2e-metal-ipi-serial-ipv4

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

@dtantsur: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@dtantsur
Copy link
Copy Markdown
Member

dtantsur commented Jun 5, 2026

/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ipi-serial-ipv4

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

@dtantsur: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@dtantsur
Copy link
Copy Markdown
Member

dtantsur commented Jun 5, 2026

/lgtm

Okay, I give up, I'm not one of the 5 people in OpenShift who understand how to use /payload-job. But the code looks like it will fix the job.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 5, 2026
@tthvo
Copy link
Copy Markdown
Member

tthvo commented Jun 5, 2026

/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ipi-ovn-serial-ipv4

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

@tthvo: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ipi-ovn-serial-ipv4

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4c43a510-60fd-11f1-8ef2-87081cd6649a-0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants