machine-config-daemon-firstboot: disable ostree fsync during bootstrap#6122
machine-config-daemon-firstboot: disable ostree fsync during bootstrap#6122sdodson wants to merge 1 commit into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
WalkthroughAdds transient ExecStartPre steps to copy and modify the ostree repo config into a tmpfs, bind-mount it over /sysroot/ostree/repo/config to set core.fsync=false for the unit lifetime, runs existing podman firstboot commands, and unmounts the bind in ExecStopPost. ChangesOStree fsync toggle during firstboot
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
| # Run this via podman because we want to use the nmstatectl binary in our container | ||
| ExecStart=/usr/bin/podman run --rm --privileged --net=host -v /:/rootfs --entrypoint machine-config-daemon '{{ .Images.machineConfigOperator }}' firstboot-complete-machineconfig --persist-nics | ||
| ExecStart=/usr/bin/podman run --rm --privileged --pid=host --net=host -v /:/rootfs --entrypoint machine-config-daemon '{{ .Images.machineConfigOperator }}' firstboot-complete-machineconfig | ||
| ExecStopPost=/usr/bin/ostree config --repo=/sysroot/ostree/repo set core.fsync true |
There was a problem hiding this comment.
This also loses whatever setting might've been set directly.
Also agree with the coderabbit comment re. error handling.
One way to structure this is to copy the config to somewhere in /run and then bind-mount on top of the config and then ostree config set it. That way it's really obvious that it's scoped to firstboot.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sdodson The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@templates/common/_base/units/machine-config-daemon-firstboot.service.yaml`:
- Around line 19-23: The ExecStartPre/ExecStopPost pair in
machine-config-daemon-firstboot.service.yaml currently disables core.fsync and
only re-enables it on clean stop, so add an idempotent recovery that always
ensures ostree core.fsync is true on subsequent boots: implement a small
follow-up systemd unit (or add to the existing machine-config-daemon.service)
that runs ostree config --repo=/sysroot/ostree/repo set core.fsync true
unconditionally at boot (or checks and sets it if false), enable it with
appropriate After= and WantedBy= so it always runs on startup, and keep the
original ExecStartPre/ExecStopPost as-is for the performance window.
- Line 19: The ExecStartPre that disables ostree core.fsync is currently not
marked as best-effort and will fail the unit if it errors; update the
machine-config-daemon-firstboot.service template so the ExecStartPre that runs
"/usr/bin/ostree ... set core.fsync false" is prefixed with "-" to make it
best-effort (so failures don't block firstboot), and also add a preceding
best-effort ExecStartPre that attempts to reset core.fsync to true
("/usr/bin/ostree ... set core.fsync true") to guard against previous runs
leaving fsync disabled; reference the ExecStartPre lines manipulating
"core.fsync" in the service unit and ensure both are best-effort so they won't
cause unit failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e1649fad-a0af-48f6-b2ce-d0cf41d462d6
📒 Files selected for processing (1)
templates/common/_base/units/machine-config-daemon-firstboot.service.yaml
jlebon
left a comment
There was a problem hiding this comment.
Ideally, the way we'd add a e.g. --disable-fsync to rpm-ostree rebase (or ostree container image pull) when we call it. But the even better fix here is to probably something like --disable-layers-fsync which skips it only when committing layers.
| # Run this via podman because we want to use the nmstatectl binary in our container | ||
| ExecStart=/usr/bin/podman run --rm --privileged --net=host -v /:/rootfs --entrypoint machine-config-daemon '{{ .Images.machineConfigOperator }}' firstboot-complete-machineconfig --persist-nics | ||
| ExecStart=/usr/bin/podman run --rm --privileged --pid=host --net=host -v /:/rootfs --entrypoint machine-config-daemon '{{ .Images.machineConfigOperator }}' firstboot-complete-machineconfig | ||
| ExecStopPost=/usr/bin/ostree config --repo=/sysroot/ostree/repo set core.fsync true |
There was a problem hiding this comment.
This also loses whatever setting might've been set directly.
Also agree with the coderabbit comment re. error handling.
One way to structure this is to copy the config to somewhere in /run and then bind-mount on top of the config and then ostree config set it. That way it's really obvious that it's scoped to firstboot.
I can pursue the bindmount option if you think that's good enough to move forward. If we really want to pursue the rpm-ostree change that feels like it's not worth it to achieve 4-10s improvement, at this time. We're taking ~ 40 seconds to rebase from RHCOS to OCP Node even when 100% of RHCOS chunks are present and if we're going to invest that level of effort I'd hope to get that cut at least in half. Here's the breakdown based on 4.22 logs Number 1 is solved in #5905 just not in 4.22. The time savings may not actually come via rpm-ostree but perhaps by fetching content for number 2 ahead of this stage, there's already about 30 seconds between when the network is up and this stage. But I'm not sure what item 3 is and if we can address that. Here's number 2 in detail from the logs. I'm curious what's happening in number 3 and if it can be avoided in some way. |
I'm assuming here that
I think it is. I thought about the bind mount option before but you implemented it this way and I thought it was good enough. That's a long way of me saying I do like the bind mount option better if it works as well. |
Is that 30s number true even with #5905 ? |
Not for this effort, I agree. But it's generally desirable of course. The impact of this likely gets worse the more layers you have.
Good enough for me too to be clear! |
9afb1fc to
c33eca9
Compare
|
/hold |
Saves a few seconds, measured at around 4-10s. Assumes we tolerate the outcome that corruption due to powerfailure means we have to re-ignite a node from scratch. Suggested by @dustymabe and original analysis at https://github.com/dustymabe/20250529-ostree-fsync-analysis Rather than writing fsync=false to disk and re-enabling it on stop, we bind mount a tmpfs-backed copy of the ostree repo config over the real path. The ostree config command then writes to the bind mount instead of the underlying file, so the on-disk config is never modified. If the host crashes or reboots before ExecStopPost runs, the kernel tears down the bind mount and the original config is automatically restored.
c33eca9 to
a50480e
Compare
…ring firstboot Two related optimizations to reduce I/O contention during node scale-up: 1. Defer rpm-ostree rollback cleanup by 60s +jitter (removeRollback) On boot 2, MCD calls `rpm-ostree cleanup -r` shortly after starting, which triggers three syncfs() calls totalling ~43s. These run concurrently with CNI image pulls and compete for disk bandwidth during the kubelet-to-ready window. Sleeping before cleanup lets the node reach Ready first. Jitter (up to 30s) spreads cleanup across nodes that scale up in the same burst. 2. Disable ostree fsync during machine-config-daemon-firstboot (from openshift#6122) Bind mount a tmpfs-backed copy of the ostree repo config over the real path before the rebase so that setting core.fsync=false is ephemeral. If the host crashes or reboots, the bind mount disappears and the on-disk config is unchanged automatically. We append a second [core] section (GKeyFile merges duplicate groups) rather than using `ostree config set` because ostree uses an atomic rename which fails with EBUSY on a bind-mounted file. Saves ~3-5s on rebase fetch. Measured at n=18 on AWS m6a.xlarge.
|
@sdodson: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Saves a few seconds on ostree rebase during firstboot, measured at ~3-5s on AWS m6a.xlarge (n=18). Assumes we tolerate the outcome that corruption due to power failure means we have to re-ignite a node from scratch. Suggested by @dustymabe and original analysis at https://github.com/dustymabe/20250529-ostree-fsync-analysis
- What I did
Bind mount a tmpfs-backed copy of the ostree repo config over
/sysroot/ostree/repo/configbefore the rebase. Settingcore.fsync=falseis then ephemeral: if the host crashes or reboots beforeExecStopPostruns, the kernel tears down the bind mount and the original on-disk config is automatically restored.We append a second
[core]section to the tmpfs copy (GKeyFile merges duplicate groups, last value wins) rather than usingostree config set. The reason: ostree writes config changes via an atomic rename, which fails withEBUSYwhen the target file is itself a bind-mount point. Discovered during testing — journald showsrenameat(./tmp.zXuFKh, config): Device or resource busyifostree config setis attempted after the bind mount is in place.- How to verify it
Prior to this change, the node journal will contain many
Starting/Completed syncfs() for system repoentries fromrpm-ostreeduring the rebase. After this change, those per-object syncfs calls are absent during the rebase window (the bind mount is confirmed active bysysroot-ostree-repo-config.mount: Deactivated successfullyat service stop).- Description for the changelog
Disabled ostree fsync during bootstrapping via ephemeral bind mount to save ~3-5s on rebase