Skip to content

[DNM] Initial attempt at using wait step to back out of failed servicing.#452

Open
jacob-anders wants to merge 4 commits intoopenshift:mainfrom
jacob-anders:bmo-service-abort-initial
Open

[DNM] Initial attempt at using wait step to back out of failed servicing.#452
jacob-anders wants to merge 4 commits intoopenshift:mainfrom
jacob-anders:bmo-service-abort-initial

Conversation

@jacob-anders
Copy link

Generated-By: Claude Code Sonnet 4

(cherry picked from commit c40d0ad)

@jacob-anders
Copy link
Author

PR meant for cluster-bot to allow rapid testing of an important workaround

@jacob-anders
Copy link
Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 28, 2026
@openshift-ci openshift-ci bot requested review from elfosardo and honza January 28, 2026 09:33
@openshift-ci
Copy link

openshift-ci bot commented Jan 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jacob-anders
Once this PR has been reviewed and has the lgtm label, please assign honza for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


// Track if HFS spec has actual settings - check independently since getHostFirmwareSettings
// returns nil when no changes even if object exists
hfsExists := &metal3api.HostFirmwareSettings{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably also do this Get inside of getHostFirmwareSettings, let's try to avoid another request

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I feel we discussed this already, probably it's an issue of digging out an old unmerged patch. Will sort it out.

// Track if HFC spec has actual updates - check independently since getHostFirmwareComponents
// returns nil when no changes even if object exists
hfcExists := &metal3api.HostFirmwareComponents{}
hfcExistsErr := r.Get(info.ctx, info.request.NamespacedName, hfcExists)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

// let the provisioner handle the transition back to active
if info.host.Status.ErrorType == metal3api.ServicingError && !hasChanges {
info.log.Info("updates removed from spec while in servicing error state, attempting recovery")
provResult, _, err := prov.Service(servicingData, false, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking aloud: as we grow capabilities to abort processes, maybe we need a separate Abort call in the provisioner. Then you won't need to pass all these Has*Spec variables, and the Ironic-level code will be cleaner (if not necessarily shorter).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into this.

ironicNode,
nodes.ProvisionStateOpts{Target: nodes.TargetAbort},
)
p.log.Info("janders_debug: abort result", "started", started, "result", result, "error", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be cleaned up

result, err = operationComplete()
case nodes.Servicing, nodes.ServiceWait:
// If user actually removed spec.updates/spec.settings while servicing is in progress, abort immediately
if !data.HasFirmwareSettingsSpec && !data.HasFirmwareComponentsSpec {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be done in Servicing, only in ServiceWait

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair point will drop nodes.Servicing.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 18, 2026
@jacob-anders jacob-anders force-pushed the bmo-service-abort-initial branch from ab976fa to 05396cd Compare February 18, 2026 04:16
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 18, 2026
@jacob-anders
Copy link
Author

/test e2e-metal-ipi-ovn-ipv6

@openshift-ci
Copy link

openshift-ci bot commented Feb 19, 2026

@jacob-anders: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-ipv6 67a5369 link true /test e2e-metal-ipi-ovn-ipv6

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants