OCPBUGS-86826: Make vsphere template updates atomic#6117
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@djoshy: This pull request references Jira Issue OCPBUGS-86826, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdds deterministic temp/rollback VM naming, pre-import cleanup of stale temp/rollback VMs, importing OVFs to a deterministic temp VM, atomic rename-based template replacement (rename old->rollback, temp->prod, destroy rollback), and reconciliation recovery that renames rollback back to production when needed. ChangesTemplate Replacement Atomicity
sequenceDiagram
participant Reconciler
participant vSphere
participant ImportedVM
participant ProdTemplate
Reconciler->>vSphere: compute tempName and rollbackName
Reconciler->>vSphere: destroyVMIfPresent tempName and rollbackName
Reconciler->>vSphere: import OVF to ImportedVM (tempName)
vSphere->>vSphere: check ProdTemplate exists
alt ProdTemplate exists
vSphere->>ProdTemplate: rename Prod -> rollback
vSphere->>ImportedVM: rename temp -> Prod
vSphere->>ProdTemplate: destroy rollback
else
vSphere->>ImportedVM: rename temp -> Prod
end
Reconciler->>vSphere: on missing ProdTemplate rename rollback -> Prod
Reconciler->>vSphere: fetch refreshed ProdTemplate reference
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error)
✅ Passed checks (14 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Command failed Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: djoshy The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@djoshy: This pull request references Jira Issue OCPBUGS-86826, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-vsphere-mco-disruptive |
|
@djoshy: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/81361080-5e98-11f1-8281-d019b8daff8d-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/controller/bootimage/vsphere_helpers.go (1)
321-333: 💤 Low valueConsider also cleaning up stale
oldTempNameVMs.Currently, only
tempName(mco-tmp-*) is cleaned up at the start. If a previousswapTemplatecompleted both renames but the finalDestroyfailed, the old template underoldTempName(mco-old-*) would remain orphaned. Adding a cleanup call foroldTempNamehere would prevent accumulation of orphaned VMs from repeated destroy failures.♻️ Suggested improvement
if err := cleanupStaleTempVM(ctx, finder, tempName); err != nil { return err } + if err := cleanupStaleTempVM(ctx, finder, oldTempName); err != nil { + return err + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controller/bootimage/vsphere_helpers.go` around lines 321 - 333, The current startup cleanup only calls cleanupStaleTempVM for tempName which leaves orphaned oldTempName VMs if a prior swap left mco-old-* behind; add a call to cleanupStaleTempVM(ctx, finder, oldTempName) alongside the existing cleanup for tempName (handle its error return the same way) before proceeding with swapTemplate/findAllRequiredResources so both tempName and oldTempName are cleaned up; reference tempName, oldTempName, cleanupStaleTempVM and ensure error is returned if cleanupStaleTempVM(ctx, finder, oldTempName) fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/controller/bootimage/vsphere_helpers.go`:
- Around line 556-559: The error returned when finder.VirtualMachine(ctx,
oldTempName) fails currently wraps only the original err and omits oldErr;
update the error construction in the rollback VM lookup (around
finder.VirtualMachine / oldVM / oldErr) to include oldErr details (e.g., add
oldErr to the fmt.Errorf message or wrap it with %w) so the returned error
contains both the template not found context and the actual failure reason from
oldErr when looking up oldTempName alongside the existing err and variables
name/oldTempName.
---
Nitpick comments:
In `@pkg/controller/bootimage/vsphere_helpers.go`:
- Around line 321-333: The current startup cleanup only calls cleanupStaleTempVM
for tempName which leaves orphaned oldTempName VMs if a prior swap left
mco-old-* behind; add a call to cleanupStaleTempVM(ctx, finder, oldTempName)
alongside the existing cleanup for tempName (handle its error return the same
way) before proceeding with swapTemplate/findAllRequiredResources so both
tempName and oldTempName are cleaned up; reference tempName, oldTempName,
cleanupStaleTempVM and ensure error is returned if cleanupStaleTempVM(ctx,
finder, oldTempName) fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 518a9907-6871-46fa-85ac-cc35b0a64636
📒 Files selected for processing (1)
pkg/controller/bootimage/vsphere_helpers.go
|
/payload-abort |
80b5b21 to
276fe58
Compare
|
/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-vsphere-mco-disruptive |
|
@djoshy: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bb9eb130-5e9a-11f1-9e66-32f081cd610f-0 |
276fe58 to
fab78e4
Compare
- What I did
- How to verify it
This is fairly hard to reproduce as it only happens due to network failures. At the very least, we should ensure that vsphere bootimage tests continue to work as expected.
Summary by CodeRabbit
New Features
Bug Fixes