qvm-template-upgrade: add orchestration workflow and in-VM agent#213
qvm-template-upgrade: add orchestration workflow and in-VM agent#213nihalxkumar wants to merge 3 commits into
Conversation
ben-grande
left a comment
There was a problem hiding this comment.
Yay, started. As I've done a review now, I will receive a notification every time you commit. Let me know when you need another look or have doubts by mentioning me.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #213 +/- ##
==========================================
- Coverage 71.72% 69.57% -2.16%
==========================================
Files 12 28 +16
Lines 1337 2485 +1148
==========================================
+ Hits 959 1729 +770
- Misses 378 756 +378 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
PipelineRetryFailed |
|
Some tests are not enabled on this repo, such as mypy, black and pylint. See this as an example: https://github.com/QubesOS/qubes-core-admin/blob/main/.gitlab-ci.yml. Can you create a separate PR to enable those checks? |
|
Sure, will do |
|
We can squash here if it's looks fine. Will also have to rebase as this is 24 commits behind. |
|
Looks fine to squash. |
71dee9e to
6c3d431
Compare
|
Oh no, commit message in the other PR closed it ... |
|
I see a conflict here - CI will not run until it's resolved. |
16b99f5 to
f1f8fc1
Compare
ben-grande
left a comment
There was a problem hiding this comment.
I have taken a look through the new code. Thanks for the progress, I like to see it evolving. Just minor review though
ad97dca to
a0289a7
Compare
|
Can you share more details of what fails on cleanup? |
|
When I tried upgrading fedora 41 -> 42 there was a cleanup failure which happened after the successful version upgrade. logs as shared privately on tuesday: https://gist.github.com/nihalxkumar/01fb990deab3960cb28d680773bb1089#file-upgrade42-log-L48-L55 we can see https://gist.github.com/nihalxkumar/0e4edccb4f4409b1a4daf5c44b6f576f After
|
|
But we can't ignore all cleanup failures. We don't know what is causing it, so this needs to be investigated. |
a90eee1 to
f66d1e7
Compare
|
a90eee1 wasn't ignoring all cleanup failures 😅 Anyways, we finally have a full end to end successful fedora upgrade pipeline More logs are at https://gist.github.com/nihalxkumar/e02475c11b6dc8a2e855ceec66599d49 For the fix had to extend the try block in qubes-core-admin-linux/vmupdate/update_manager.py Lines 362 to 366 in 497d467 I have mentioned details in the f66d1e7's description. |
Add qvm-template-upgrade as an initial safe upgrade workflow for TemplateVMs and StandaloneVMs. The command validates the source qube, derives the next distro-version clone name, clones the source, updates template metadata, and cleans up failed clones unless explicitly asked to keep them. The version-upgrade agent hook remains a stub for now, so the command can land the orchestration, rollback behavior, and tests without pretending to perform in-VM distro upgrades yet. Fixes: QubesOS/qubes-issues#8605
This adds the in-qube side of the distro version upgrade and connects it to the dom0 orchestrator. The package manager grows a version_upgrade entry point whose default fails loud, so families without a real path return an error. The dnf path re-reads the distribution from inside the qube, refuses anything that isn't a single-step move to a RedHat-family release, and runs distro-sync onto the target releasever. A --version-upgrade flag selects it, and value-bearing agent args are skipped when unset so a normal update never injects a bare "None". On dom0, run_agent drives the clone through the existing vmupdate qrexec transport, forwarding the agent's streamed output to the log so the user sees progress. On failure the clone is shut down before deletion. StandaloneVM clones keep their inherited template-* features untouched.
When QubeConnection.__exit__ shut the upgraded clone down it called is_running()/shutdown() outside any error handling. An exception raised there propagated out of _run_agent, was caught by update_qube's broad except, and surfaced as ERR_VM_UNHANDLED (exit code 26), which rolled back and deleted an otherwise successfully upgraded clone. The detail explaining the real cause was buried in result.out, which run_agent discarded, so it never reached the tool log. Wrap the shutdown block in a broad try/except that logs the error instead of raising, mirroring the rm cleanup immediately above it, so a teardown problem can no longer mask a successful upgrade. Also append result.out/result.err to the UpgradeError so the underlying cause is visible in the tool log when the agent genuinely fails.
f66d1e7 to
bd9a723
Compare
|
Unsure about this one also. If shutdown failed, the upgrade might have failed despite the package manager not returning an error during upgrade. Which exception were you receiving? |
By "all" I thought we are including the regular-updates cleanup behavior as well, which were untouched. On the version-upgrade path, though, you're right: once the upgrade returned I've since dropped that tolerance change. As in every run |
The exception was an It's raised in the pre-existing To your concern that a failed shutdown could mean a failed upgrade: in this case it didn't. The agent finishes cleanly: My dom0 is on
We could also harden the source instead, |
Can you upgrade to R4.3? It would make testing much easier, cause currently you are testing an EOL release.
The exception should be raised so the command fails. It won't be raised when you have R4.3. |


This PR introduces the
qvm-template-upgradedom0 command-line utility, that performs an in-place N -> N+1 distribution upgrade of Debian and Fedora TemplateVM or StandaloneVMfixes: QubesOS/qubes-issues#8605
GSoC 2026 project: Automate Template Version Upgrade