Skip to content

fix(cloud.cfg.tmpl): move install_hotplug to an earlier point in the config#6890

Open
DarkPhily wants to merge 2 commits into
canonical:mainfrom
hetznercloud:hetzner_move_install_hotplug
Open

fix(cloud.cfg.tmpl): move install_hotplug to an earlier point in the config#6890
DarkPhily wants to merge 2 commits into
canonical:mainfrom
hetznercloud:hetzner_move_install_hotplug

Conversation

@DarkPhily
Copy link
Copy Markdown
Contributor

@DarkPhily DarkPhily commented May 20, 2026

Proposed Commit Message

fix(cloud.cfg.tmpl): move install_hotplug to an earlier point in the config

We encountered a race condition with the handling of hotplug events.
It's possible that the additional network device is attached right after
the init-network stage and before the config-instsall_hotplug finished.
The likelyhood of this happening is increasing by a "long" running
runcmd script.

That's the reason we propose moving the install_hotplug hook to an
earlier point in the config.

Test Steps

Script to reproduce:

echo '
#cloud-config
runcmd:
- sleep 30
' | hcloud server create --name hotplug-test --type cpx12 --image ubuntu-24.04 --location ash  --user-data-from-file=/dev/stdin 
hcloud server attach-to-network --network hotplug-test hotplug-test

Log output:

# first network configuration stage
2026-05-13 15:53:04,860 - DataSourceHetzner.py[DEBUG]: Using private_networks source: 'http://[fe80::a9fe:a9fe%25enp1s0]/hetzner/v1/metadata/private-networks'

# NIC event from dmegs
[Wed May 13 15:53:10 2026] virtio_net virtio6 enp7s0: renamed from eth1

# hotplug install from cloudinit
2026-05-13 15:53:43,246 - cc_install_hotplug.py[INFO]: Installing hotplug.

Merge type

  • Squash merge using "Proposed Commit Message"

Copy link
Copy Markdown
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this submission @DarkPhily.

  1. The move of install_hotplug to just after disk_setup feels arbitrary. Any reason behind choosing it to run there instead of being the first module run on cloud_init_modules section (just before seed_random)?

  2. One thing we need to be wary of is that hotplug events end up triggering a Datasource pickle via _write_to_cache. The pickled datasource is read in every boot stage of cloud-init except the modules:final stage, it is also written in the earlier cloud-init-local and cloud-init-'network' stages.

root@testr:~# egrep 'Cloud-init|obj.pkl' /var/log/cloud-init.log 
2026-05-27 14:26:06,086 - log_util.py[DEBUG]: Cloud-init v. 26.1-0ubuntu2 running 'init-local' at Wed, 27 May 2026 14:26:06 +0000. Up 0.65 seconds.
2026-05-27 14:26:06,089 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2026-05-27 14:26:06,140 - util.py[DEBUG]: Writing to /var/lib/cloud/instance/obj.pkl - wb: [400] 6812 bytes
2026-05-27 14:26:06,237 - log_util.py[DEBUG]: Cloud-init v. 26.1-0ubuntu2 running 'init' at Wed, 27 May 2026 14:26:06 +0000. Up 0.81 seconds.
2026-05-27 14:26:06,249 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2026-05-27 14:26:06,249 - util.py[DEBUG]: Reading 6812 bytes from /var/lib/cloud/instance/obj.pkl
2026-05-27 14:26:06,267 - util.py[DEBUG]: Writing to /var/lib/cloud/instance/obj.pkl - wb: [400] 6920 bytes
2026-05-27 14:26:06,291 - util.py[DEBUG]: Writing to /var/lib/cloud/instance/obj.pkl - wb: [400] 9281 bytes
2026-05-27 14:26:06,663 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2026-05-27 14:26:06,663 - util.py[DEBUG]: Reading 9281 bytes from /var/lib/cloud/instance/obj.pkl
2026-05-27 14:26:06,672 - log_util.py[DEBUG]: Cloud-init v. 26.1-0ubuntu2 running 'modules:config' at Wed, 27 May 2026 14:26:06 +0000. Up 1.23 seconds.
2026-05-27 14:26:07,188 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2026-05-27 14:26:07,188 - util.py[DEBUG]: Reading 9281 bytes from /var/lib/cloud/instance/obj.pkl
2026-05-27 14:26:07,200 - log_util.py[DEBUG]: Cloud-init v. 26.1-0ubuntu2 running 'modules:final' at Wed, 27 May 2026 14:26:07 +0000. Up 1.76 seconds.
2026-05-27 14:26:07,279 - log_util.py[DEBUG]: Cloud-init v. 26.1-0ubuntu2 finished at Wed, 27 May 2026 14:26:07 +0000. Datasource DataSourceLXD.  Up 1.86 seconds

So, we need to be careful when moving install_hotplug earlier in boot. If a hotplug event happens while one of those reads are being performed by the cloud-init boot stages we could run into partial read errors while a boot stage is trying to read the obj.pkl if the hotplug event is trying to write it at the same time.

2A. I think we may want to ensure we reduce likelihood of hitting partial read/write issues with either the use of atomic_helper.write_file inside the _write_to_cache function. Please make this change in this PR and add unittests to assert that _write_to_cache properly uses atomic_helper.write_file instead of util.write_file.

2B: I don't think we need to consider trying to mitigate split-brain or lost updates due to simultaneous init-network stage and hotplug because "activate-datasource" which writes to obj.pkl in cloud-init local and network stage happens before any of the individual config modules in cloud_init_modules start running.

  1. Additionally, I'd like to see if we can capture the full logs of a successful hotplug run by attaching cloud-init collect-logs to this pull request once we minimally have atomic_write in place and unittest coveraging the atomic_write_file operation triggered by _write_to_cache so we can see the impact of such a hotplug event on a live system.

For this PR I'd like to see 1, 2A and 3 addressed to reduce our risks of shifting this functionality earlier in boot. I don't think we need to address 2B for this proposed ordering, because the config modules defined in cloud_init_modules section runs in cloud-init network stage only after 'activate datasource' is called which is the last operation (outside of hotplug) which writes to obj,pkl.

Comment thread config/cloud.cfg.tmpl
{% endif %}
{% if not is_bsd %}
- disk_setup
- install_hotplug
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Pr moves install_hotplug into a section that is now excluded from BSD systems. This is fine because cc_install_hotplug already NOOPs on absence of udevadm which BSD systems will not have present. But, this should be documented in the proposed commit message.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the PR moves this to before seed_random let's still ensure it is excluded below an {% if not is_bsd %} template conditional as I agree this shouldn't be relevant on *BSD and will noop anyway.

@blackboxsw blackboxsw self-assigned this May 27, 2026
@DarkPhily
Copy link
Copy Markdown
Contributor Author

  1. We weren't sure if a datasource needs to be present at the point of loading the module.
def handle(name: str, cfg: Config, cloud: Cloud, args: list) -> None:
    network_hotplug_enabled = (
        "updates" in cfg
        and "network" in cfg["updates"]
        and "when" in cfg["updates"]["network"]
        and "hotplug" in cfg["updates"]["network"]["when"]
    )
    install_hotplug(cloud.datasource, cfg, network_hotplug_enabled)

We assumed that handle is called and therefore a datasource is necessary. If this assumption isn't correct I would be happy for you to shed some light on that matter. I would be more than happy to move it even before seed_random to eliminate any possibilities of a race condition.

I will start working on the other points for now.
Thanks for your feedback.

@DarkPhily
Copy link
Copy Markdown
Contributor Author

I try to build the debian package with packages/bddeb -d, but it's again failing.
It appears, that the same bug as described in #6141 and fixed in https://github.com/canonical/cloud-init/pull/6464/changes was somehow reintroduced.

I'm running the build command on our cloud-server as root and it's again overwriting my /etc/netplan/50-cloud-init.yml

The failed tests are:

        =========================== short test summary info ============================
        FAILED tests/unittests/distros/test_netconfig.py::TestNetCfgDistroUbuntuNetplan::test_apply_network_config_v1_to_netplan_ub
        FAILED tests/unittests/distros/test_netconfig.py::TestNetCfgDistroUbuntuNetplan::test_apply_network_config_v1_ipv6_to_netplan_ub
        FAILED tests/unittests/distros/test_netconfig.py::TestNetCfgDistroUbuntuNetplan::test_apply_network_config_v2_passthrough_ub
        FAILED tests/unittests/distros/test_netconfig.py::TestNetCfgDistroUbuntuNetplan::test_apply_network_config_v2_passthrough_retain_orig_perms
        FAILED tests/unittests/distros/test_netconfig.py::TestNetCfgDistroUbuntuNetplan::test_apply_network_config_v2_passthrough_ub_old_behavior
        FAILED tests/unittests/distros/test_netconfig.py::TestNetCfgDistroUbuntuNetplan::test_apply_network_config_v2_full_passthrough_ub
        FAILED tests/unittests/distros/test_netconfig.py::TestNetCfgDistroArch::test_apply_network_config_v1_with_netplan
        FAILED tests/unittests/net/test_net_rendering.py::test_convert[no_matching_mac_v2-Renderer.Netplan|NetworkManager]
        = 8 failed, 5638 passed, 9 skipped, 13 xfailed, 2 xpassed, 84 warnings in 101.18s (0:01:41) =

@DarkPhily
Copy link
Copy Markdown
Contributor Author

These are the requested logs:
cloud-init.tar.gz

@DarkPhily DarkPhily requested a review from blackboxsw June 1, 2026 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants