fix(cloud.cfg.tmpl): move install_hotplug to an earlier point in the config#6890
fix(cloud.cfg.tmpl): move install_hotplug to an earlier point in the config#6890DarkPhily wants to merge 2 commits into
Conversation
blackboxsw
left a comment
There was a problem hiding this comment.
Thank you for this submission @DarkPhily.
-
The move of install_hotplug to just after disk_setup feels arbitrary. Any reason behind choosing it to run there instead of being the first module run on cloud_init_modules section (just before seed_random)?
-
One thing we need to be wary of is that hotplug events end up triggering a Datasource pickle via _write_to_cache. The pickled datasource is read in every boot stage of cloud-init except the modules:final stage, it is also written in the earlier cloud-init-local and cloud-init-'network' stages.
root@testr:~# egrep 'Cloud-init|obj.pkl' /var/log/cloud-init.log
2026-05-27 14:26:06,086 - log_util.py[DEBUG]: Cloud-init v. 26.1-0ubuntu2 running 'init-local' at Wed, 27 May 2026 14:26:06 +0000. Up 0.65 seconds.
2026-05-27 14:26:06,089 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2026-05-27 14:26:06,140 - util.py[DEBUG]: Writing to /var/lib/cloud/instance/obj.pkl - wb: [400] 6812 bytes
2026-05-27 14:26:06,237 - log_util.py[DEBUG]: Cloud-init v. 26.1-0ubuntu2 running 'init' at Wed, 27 May 2026 14:26:06 +0000. Up 0.81 seconds.
2026-05-27 14:26:06,249 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2026-05-27 14:26:06,249 - util.py[DEBUG]: Reading 6812 bytes from /var/lib/cloud/instance/obj.pkl
2026-05-27 14:26:06,267 - util.py[DEBUG]: Writing to /var/lib/cloud/instance/obj.pkl - wb: [400] 6920 bytes
2026-05-27 14:26:06,291 - util.py[DEBUG]: Writing to /var/lib/cloud/instance/obj.pkl - wb: [400] 9281 bytes
2026-05-27 14:26:06,663 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2026-05-27 14:26:06,663 - util.py[DEBUG]: Reading 9281 bytes from /var/lib/cloud/instance/obj.pkl
2026-05-27 14:26:06,672 - log_util.py[DEBUG]: Cloud-init v. 26.1-0ubuntu2 running 'modules:config' at Wed, 27 May 2026 14:26:06 +0000. Up 1.23 seconds.
2026-05-27 14:26:07,188 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2026-05-27 14:26:07,188 - util.py[DEBUG]: Reading 9281 bytes from /var/lib/cloud/instance/obj.pkl
2026-05-27 14:26:07,200 - log_util.py[DEBUG]: Cloud-init v. 26.1-0ubuntu2 running 'modules:final' at Wed, 27 May 2026 14:26:07 +0000. Up 1.76 seconds.
2026-05-27 14:26:07,279 - log_util.py[DEBUG]: Cloud-init v. 26.1-0ubuntu2 finished at Wed, 27 May 2026 14:26:07 +0000. Datasource DataSourceLXD. Up 1.86 seconds
So, we need to be careful when moving install_hotplug earlier in boot. If a hotplug event happens while one of those reads are being performed by the cloud-init boot stages we could run into partial read errors while a boot stage is trying to read the obj.pkl if the hotplug event is trying to write it at the same time.
2A. I think we may want to ensure we reduce likelihood of hitting partial read/write issues with either the use of atomic_helper.write_file inside the _write_to_cache function. Please make this change in this PR and add unittests to assert that _write_to_cache properly uses atomic_helper.write_file instead of util.write_file.
2B: I don't think we need to consider trying to mitigate split-brain or lost updates due to simultaneous init-network stage and hotplug because "activate-datasource" which writes to obj.pkl in cloud-init local and network stage happens before any of the individual config modules in cloud_init_modules start running.
- Additionally, I'd like to see if we can capture the full logs of a successful hotplug run by attaching
cloud-init collect-logsto this pull request once we minimally have atomic_write in place and unittest coveraging the atomic_write_file operation triggered by _write_to_cache so we can see the impact of such a hotplug event on a live system.
For this PR I'd like to see 1, 2A and 3 addressed to reduce our risks of shifting this functionality earlier in boot. I don't think we need to address 2B for this proposed ordering, because the config modules defined in cloud_init_modules section runs in cloud-init network stage only after 'activate datasource' is called which is the last operation (outside of hotplug) which writes to obj,pkl.
| {% endif %} | ||
| {% if not is_bsd %} | ||
| - disk_setup | ||
| - install_hotplug |
There was a problem hiding this comment.
This Pr moves install_hotplug into a section that is now excluded from BSD systems. This is fine because cc_install_hotplug already NOOPs on absence of udevadm which BSD systems will not have present. But, this should be documented in the proposed commit message.
There was a problem hiding this comment.
If the PR moves this to before seed_random let's still ensure it is excluded below an {% if not is_bsd %} template conditional as I agree this shouldn't be relevant on *BSD and will noop anyway.
We assumed that handle is called and therefore a datasource is necessary. If this assumption isn't correct I would be happy for you to shed some light on that matter. I would be more than happy to move it even before I will start working on the other points for now. |
|
I try to build the debian package with I'm running the build command on our cloud-server as root and it's again overwriting my The failed tests are: |
|
These are the requested logs: |
Proposed Commit Message
Test Steps
Script to reproduce:
Log output:
Merge type