Skip to content

qvm-template-upgrade: add orchestration workflow and in-VM agent#213

Open
nihalxkumar wants to merge 3 commits into
QubesOS:mainfrom
nihalxkumar:qvm-template-upgrade
Open

qvm-template-upgrade: add orchestration workflow and in-VM agent#213
nihalxkumar wants to merge 3 commits into
QubesOS:mainfrom
nihalxkumar:qvm-template-upgrade

Conversation

@nihalxkumar

@nihalxkumar nihalxkumar commented May 25, 2026

Copy link
Copy Markdown
Contributor

This PR introduces the qvm-template-upgrade dom0 command-line utility, that performs an in-place N -> N+1 distribution upgrade of Debian and Fedora TemplateVM or StandaloneVM

fixes: QubesOS/qubes-issues#8605
GSoC 2026 project: Automate Template Version Upgrade

@ben-grande ben-grande left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, started. As I've done a review now, I will receive a notification every time you commit. Let me know when you need another look or have doubts by mentioning me.

Comment thread vmupdate/tests/test_template_upgrade.py
Comment thread vmupdate/template_upgrade.py Outdated
Comment thread vmupdate/template_upgrade.py Outdated
Comment thread vmupdate/template_upgrade.py Outdated
Comment thread vmupdate/template_upgrade.py Outdated
Comment thread vmupdate/template_upgrade.py Outdated
Comment thread vmupdate/template_upgrade.py Outdated
@codecov-commenter

codecov-commenter commented Jun 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 99.19614% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.57%. Comparing base (497d467) to head (bd9a723).

Files with missing lines Patch % Lines
vmupdate/agent/source/common/package_manager.py 69.23% 4 Missing ⚠️
vmupdate/agent/source/dnf/dnf_cli.py 97.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #213      +/-   ##
==========================================
- Coverage   71.72%   69.57%   -2.16%     
==========================================
  Files          12       28      +16     
  Lines        1337     2485    +1148     
==========================================
+ Hits          959     1729     +770     
- Misses        378      756     +378     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@nihalxkumar nihalxkumar marked this pull request as ready for review June 2, 2026 05:29
@ben-grande

Copy link
Copy Markdown
Contributor

PipelineRetryFailed

@nihalxkumar

Copy link
Copy Markdown
Contributor Author

It's showing a successful run

image image

@ben-grande

Copy link
Copy Markdown
Contributor

Some tests are not enabled on this repo, such as mypy, black and pylint. See this as an example: https://github.com/QubesOS/qubes-core-admin/blob/main/.gitlab-ci.yml. Can you create a separate PR to enable those checks?

@nihalxkumar

Copy link
Copy Markdown
Contributor Author

Sure, will do

Comment thread vmupdate/template_upgrade.py
Comment thread vmupdate/template_upgrade.py Outdated
Comment thread vmupdate/template_upgrade.py Outdated
Comment thread vmupdate/template_upgrade.py Outdated
Comment thread vmupdate/template_upgrade.py Outdated
Comment thread vmupdate/template_upgrade.py Outdated
Comment thread vmupdate/template_upgrade.py Outdated
Comment thread vmupdate/template_upgrade.py Outdated
@nihalxkumar

Copy link
Copy Markdown
Contributor Author

We can squash here if it's looks fine. Will also have to rebase as this is 24 commits behind.

@ben-grande

Copy link
Copy Markdown
Contributor

Looks fine to squash.

@marmarek

Copy link
Copy Markdown
Member

Oh no, commit message in the other PR closed it ...

@marmarek marmarek reopened this Jun 10, 2026
@marmarek

Copy link
Copy Markdown
Member

I see a conflict here - CI will not run until it's resolved.

@nihalxkumar nihalxkumar force-pushed the qvm-template-upgrade branch 6 times, most recently from 16b99f5 to f1f8fc1 Compare June 14, 2026 16:29
@nihalxkumar nihalxkumar changed the title qvm-template-upgrade: add CLI skeleton and orchestration flow qvm-template-upgrade: add orchestration workflow and in-VM agent Jun 14, 2026

@ben-grande ben-grande left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have taken a look through the new code. Thanks for the progress, I like to see it evolving. Just minor review though

Comment thread vmupdate/agent/source/common/package_manager.py
Comment thread vmupdate/agent/source/common/package_manager.py Outdated
Comment thread vmupdate/agent/source/dnf/dnf_cli.py Outdated
Comment thread vmupdate/agent/source/dnf/dnf_cli.py Outdated
@nihalxkumar nihalxkumar force-pushed the qvm-template-upgrade branch from ad97dca to a0289a7 Compare June 15, 2026 17:11
@ben-grande

Copy link
Copy Markdown
Contributor

@nihalxkumar

Copy link
Copy Markdown
Contributor Author

When I tried upgrading fedora 41 -> 42 there was a cleanup failure which happened after the successful version upgrade.

logs as shared privately on tuesday:

https://gist.github.com/nihalxkumar/01fb990deab3960cb28d680773bb1089#file-upgrade42-log-L48-L55

we can see Complete! in the above log.

https://gist.github.com/nihalxkumar/0e4edccb4f4409b1a4daf5c44b6f576f

After version_upgrade() returns success, the agent still runs cleanup. The old code used the cleanup exit code when deciding the final agent exit code.
For dnf this includes DNFCLI.clean() (dnf clean packages), and the transport also removes /run/qubes-update/.

a90eee1 (last commit) is based on that failure boundary. After the release transaction has succeeded, a later cleanup failure should not cause rollback of the upgraded clone.

@ben-grande

Copy link
Copy Markdown
Contributor

But we can't ignore all cleanup failures. We don't know what is causing it, so this needs to be investigated.

@nihalxkumar nihalxkumar force-pushed the qvm-template-upgrade branch from a90eee1 to f66d1e7 Compare June 26, 2026 18:55
@nihalxkumar

Copy link
Copy Markdown
Contributor Author

a90eee1 wasn't ignoring all cleanup failures 😅 Anyways, we finally have a full end to end successful fedora upgrade pipeline

fedora-42-xfce:err: [3662/3664] Removing fedora-release-ide 100% |  21.0   B/s |   1.0   B |  00m00s
fedora-42-xfce:err: [3663/3664] Removing ncurses-base-0:6.5 100% |   2.6 KiB/s | 179.0   B |  00m00s
fedora-42-xfce:err: warning: posix.fork(): .fork(), .exec(), .wait() and .redirect2null() are deprecated, use rpm.spawn() or rpm.execute() instead
fedora-42-xfce:err: warning: posix.wait(): .fork(), .exec(), .wait() and .redirect2null() are deprecated, use rpm.spawn() or rpm.execute() instead
fedora-42-xfce:err: warning: posix.exec(): .fork(), .exec(), .wait() and .redirect2null() are deprecated, use rpm.spawn() or rpm.execute() instead
fedora-42-xfce:out: /etc/selinux/targeted/contexts/files/file_contexts.bin:  Old compiled fcontext format, skipping
fedora-42-xfce:out: /etc/selinux/targeted/contexts/files/file_contexts.homedirs.bin:  Old compiled fcontext format, skipping
fedora-42-xfce:out: /etc/selinux/targeted/contexts/files/file_contexts.bin:  Old compiled fcontext format, skipping
fedora-42-xfce:out: /etc/selinux/targeted/contexts/files/file_contexts.homedirs.bin:  Old compiled fcontext format, skipping
fedora-42-xfce:out: /usr/sbin cannot be merged yet, found /usr/sbin/capsh
fedora-42-xfce:out: All files under /usr/local/sbin are symlinks; linking to ./bin...
fedora-42-xfce:out: ...done
fedora-42-xfce:out: /usr/sbin cannot be merged yet, found /usr/sbin/capsh
fedora-42-xfce:err: [3664/3664] Removing libgcc-0:14.3.1-4. 100% |   0.0   B/s |  11.0   B |  25m26s
fedora-42-xfce:err: >>> Running %posttrans scriptlet: qubes-core-agent-selinux-0:4.2.47-1.fc42.noarc
fedora-42-xfce:err: >>> Finished %posttrans scriptlet: qubes-core-agent-selinux-0:4.2.47-1.fc42.noar
fedora-42-xfce:err: >>> [RPM] libselinux: type 0: /etc/selinux/targeted/contexts/files/file_contexts
fedora-42-xfce:err: >>> [RPM] libselinux: type 0: /etc/selinux/targeted/contexts/files/file_contexts
fedora-42-xfce:err: >>> Running %posttrans scriptlet: kernel-core-0:6.19.14-108.fc42.x86_64         
fedora-42-xfce:err: >>> Finished %posttrans scriptlet: kernel-core-0:6.19.14-108.fc42.x86_64        
fedora-42-xfce:err: >>> Scriptlet output:                                                           
fedora-42-xfce:err: >>> dracut[E]: Module 'systemd-cryptsetup' depends on module 'crypt', which can'
fedora-42-xfce:err: >>> dracut[E]: Module 'systemd-pcrphase' depends on module 'tpm2-tss', which can
fedora-42-xfce:err: >>>                                                                             
fedora-42-xfce:err: >>> Running %posttrans scriptlet: qubes-kernel-vm-support-0:4.2.22-1.fc42.x86_64
fedora-42-xfce:err: >>> Finished %posttrans scriptlet: qubes-kernel-vm-support-0:4.2.22-1.fc42.x86_6
fedora-42-xfce:err: >>> Scriptlet output:                                                           
fedora-42-xfce:err: >>> dracut[E]: Module 'systemd-cryptsetup' depends on module 'crypt', which can'
fedora-42-xfce:err: >>> dracut[E]: Module 'systemd-pcrphase' depends on module 'tpm2-tss', which can
fedora-42-xfce:err: >>> dracut[E]: Module 'systemd-cryptsetup' depends on module 'crypt', which can'
fedora-42-xfce:err: >>> dracut[E]: Module 'systemd-pcrphase' depends on module 'tpm2-tss', which can
fedora-42-xfce:err: >>> dracut[E]: Module 'systemd-cryptsetup' depends on module 'crypt', which can'
fedora-42-xfce:err: >>> dracut[E]: Module 'systemd-pcrphase' depends on module 'tpm2-tss', which can
fedora-42-xfce:err: >>>                                                                             
fedora-42-xfce:err: >>> Running %triggerin scriptlet: systemd-0:257.13-1.fc42.x86_64                
fedora-42-xfce:err: >>> Finished %triggerin scriptlet: systemd-0:257.13-1.fc42.x86_64               
fedora-42-xfce:err: >>> Scriptlet output:                                                           
fedora-42-xfce:err: >>>                                                                             
fedora-42-xfce:err: Complete!
Updating metadata on fedora-42-xfce
Upgrade complete. New template: fedora-42-xfce
Original qube fedora-41-xfce is untouched.
[nxk@dom0 ~]$ 

More logs are at https://gist.github.com/nihalxkumar/e02475c11b6dc8a2e855ceec66599d49

For the fix had to extend the try block in

except Exception as exc: # pylint: disable=broad-except
status_notifier.put(StatusInfo.done(qube, FinalStatus.ERROR))
return qube.name, ProcessResult(
EXIT.ERR_VM_UNHANDLED, f"ERROR (exception {str(exc)})"
)

I have mentioned details in the f66d1e7's description.

Add qvm-template-upgrade as an initial safe upgrade workflow for
TemplateVMs and StandaloneVMs. The command validates the source qube,
derives the next distro-version clone name, clones the source, updates
template metadata, and cleans up failed clones unless explicitly asked
to keep them.

The version-upgrade agent hook remains a stub for now, so the command
can land the orchestration, rollback behavior, and tests without
pretending to perform in-VM distro upgrades yet.

Fixes: QubesOS/qubes-issues#8605
This adds the in-qube side of the distro version upgrade and connects it
to the dom0 orchestrator.

The package manager grows a version_upgrade entry point whose default
fails loud, so families without a real path return an error. The dnf
path re-reads the distribution from inside the qube, refuses anything
that isn't a single-step move to a RedHat-family release, and runs
distro-sync onto the target releasever.  A --version-upgrade flag
selects it, and value-bearing agent args are skipped when unset so a
normal update never injects a bare "None".

On dom0, run_agent drives the clone through the existing vmupdate qrexec
transport, forwarding the agent's streamed output to the log so the user
sees progress. On failure the clone is shut down before deletion.
StandaloneVM clones keep their inherited template-* features untouched.
When QubeConnection.__exit__ shut the upgraded clone down it called
is_running()/shutdown() outside any error handling. An exception raised
there propagated out of _run_agent, was caught by update_qube's broad
except, and surfaced as ERR_VM_UNHANDLED (exit code 26), which rolled
back and deleted an otherwise successfully upgraded clone. The detail
explaining the real cause was buried in result.out, which run_agent
discarded, so it never reached the tool log.

Wrap the shutdown block in a broad try/except that logs the error
instead of raising, mirroring the rm cleanup immediately above it, so a
teardown problem can no longer mask a successful upgrade. Also append
result.out/result.err to the UpgradeError so the underlying cause is
visible in the tool log when the agent genuinely fails.
@nihalxkumar nihalxkumar force-pushed the qvm-template-upgrade branch from f66d1e7 to bd9a723 Compare June 26, 2026 19:07
@ben-grande

Copy link
Copy Markdown
Contributor

a90eee1 wasn't ignoring all cleanup failures 😅

a90eee1

How not? It was ignoring all cleanup failures if pkg_mng.version_upgrade returned EXIT.OK. Can you explain in more details? I didn't check if this change remained in the latest commits.

@ben-grande

Copy link
Copy Markdown
Contributor

f66d1e7

Unsure about this one also. If shutdown failed, the upgrade might have failed despite the package manager not returning an error during upgrade. Which exception were you receiving?

@nihalxkumar

Copy link
Copy Markdown
Contributor Author

a90eee1 wasn't ignoring all cleanup failures 😅

a90eee1

How not? It was ignoring all cleanup failures if pkg_mng.version_upgrade returned EXIT.OK. Can you explain in more details? I didn't check if this change remained in the latest commits.

By "all" I thought we are including the regular-updates cleanup behavior as well, which were untouched. On the version-upgrade path, though, you're right: once the upgrade returned EXIT.OK, a90eee1 suppressed any non-zero clean() code. My reasoning was that a failing clean() after a completed release transaction is just a cache issue, but the dom0 tool rolls back on any agent exit != EXIT.OK (template_upgrade.py#L307) a fatal cleanup code would delete a good upgrade. So I tolerated it only on the success path and kept it fatal for normal updates.

I've since dropped that tolerance change. As in every run dnf clean actually returned 0, so it was never being triggered; and the rollbacks I was chasing were exit-26 (ERR_VM_UNHANDLED), whereas a cleanup failure can only ever be exit-25 (ERR_VM_CLEANUP) (dnf_cli.py#L145-L151), which is already in VM_HANDLED (exit_codes.py#L34-L40) and never escalates to 26. So cleanup wasn't the cause. The real source of exit-26 was the teardown, which bd9a723 addresses.

@nihalxkumar

Copy link
Copy Markdown
Contributor Author

f66d1e7

Unsure about this one also. If shutdown failed, the upgrade might have failed despite the package manager not returning an error during upgrade. Which exception were you receiving?

The exception was an AttributeError:

2026-06-26 07:52:44,585 Cannot shutdown fedora-42-xfce, because of error: 'DeviceCollection' object has no attribute 'get_assigned_devices'

It's raised in the pre-existing _has_assigned_pci_devices helper (qube_connection.py#L109-L112, added in 12e9e32 3 months ago) while __exit__ shutsdown the clone.

To your concern that a failed shutdown could mean a failed upgrade: in this case it didn't. The agent finishes cleanly: distro-sync exit 0, dnf clean packages exit 0 and the failure only appears afterwards, at shutdown (logs shared earlier):

2026-06-26 05:40:02,642 [Agent] run command: dnf clean all
2026-06-26 05:40:04,387 [Agent] command exit code: 0
2026-06-26 05:40:04,391 [Agent] run command: dnf --releasever=42 distro-sync --best --allowerasing --assumeyes
2026-06-26 07:52:09,778 [Agent] command exit code: 0
2026-06-26 07:52:09,798 [Agent] version-upgrade out: Removed 24 files, 11 directories (total of 87 MiB). 0 errors occurred.
2026-06-26 07:52:09,808 [Agent] version-upgrade out: 
2026-06-26 07:52:09,817 [Agent] Notify dom0 about upgrades.
2026-06-26 07:52:38,827 [Agent] run command: dnf clean packages
2026-06-26 07:52:39,630 [Agent] command exit code: 0
2026-06-26 07:52:43,427 Remove /run/qubes-update/
2026-06-26 07:52:43,427 run command in fedora-42-xfce: rm -r /run/qubes-update/
2026-06-26 07:52:43,428 Wait for output
2026-06-26 07:52:44,585 Cannot shutdown fedora-42-xfce, because of error: 'DeviceCollection' object has no attribute 'get_assigned_devices'
2026-06-26 07:52:44,586 agent output: 
2026-06-26 07:52:44,586 agent output: 
2026-06-26 07:52:44,587 agent exit code: 0

My dom0 is on qubes-core-admin-client-4.2.18, whose DeviceCollection still exposes assignments()/attached() and not get_assigned_devices(). The latter is the R4.3/main device API that this branch targets, so on an actual R4.3 dom0 this specific AttributeError shouldn't fire.

bd9a723 logs the shutdown error instead of letting a teardown exception propagate out of _run_agent, get caught by update_qube's broad except, and surface as ERR_VM_UNHANDLED which then rolls back and deletes a clone whose upgrade already completed. It also appends the agent's out/err to the UpgradeError, so the real cause reaches the tool log next time instead of being discarded.

We could also harden the source instead, _has_assigned_pci_devices fall back across both device APIs (or catch AttributeError there) so the shutdown path itself is robust.

@ben-grande

Copy link
Copy Markdown
Contributor

My dom0 is on qubes-core-admin-client-4.2.18, whose DeviceCollection still exposes assignments()/attached() and not get_assigned_devices(). The latter is the R4.3/main device API that this branch targets, so on an actual R4.3 dom0 this specific AttributeError shouldn't fire.

Can you upgrade to R4.3? It would make testing much easier, cause currently you are testing an EOL release.

bd9a723 logs the shutdown error instead of letting a teardown exception propagate out of _run_agent, get caught by update_qube's broad except, and surface as ERR_VM_UNHANDLED which then rolls back and deletes a clone whose upgrade already completed. It also appends the agent's out/err to the UpgradeError, so the real cause reaches the tool log next time instead of being discarded.

The exception should be raised so the command fails. It won't be raised when you have R4.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Contribution] qvm-upgrade-template (easy in-place upgrades for Debian and Fedora templates)

4 participants