machine-config-daemon-firstboot: disable ostree fsync during bootstrap by sdodson · Pull Request #6122 · openshift/machine-config-operator

sdodson · 2026-06-02T18:10:35Z

Saves a few seconds on ostree rebase during firstboot, measured at ~3-5s on AWS m6a.xlarge (n=18). Assumes we tolerate the outcome that corruption due to power failure means we have to re-ignite a node from scratch. Suggested by @dustymabe and original analysis at https://github.com/dustymabe/20250529-ostree-fsync-analysis

- What I did

Bind mount a tmpfs-backed copy of the ostree repo config over /sysroot/ostree/repo/config before the rebase. Setting core.fsync=false is then ephemeral: if the host crashes or reboots before ExecStopPost runs, the kernel tears down the bind mount and the original on-disk config is automatically restored.

We append a second [core] section to the tmpfs copy (GKeyFile merges duplicate groups, last value wins) rather than using ostree config set. The reason: ostree writes config changes via an atomic rename, which fails with EBUSY when the target file is itself a bind-mount point. Discovered during testing — journald shows renameat(./tmp.zXuFKh, config): Device or resource busy if ostree config set is attempted after the bind mount is in place.

- How to verify it

Prior to this change, the node journal will contain many Starting/Completed syncfs() for system repo entries from rpm-ostree during the rebase. After this change, those per-object syncfs calls are absent during the rebase window (the bind mount is confirmed active by sysroot-ostree-repo-config.mount: Deactivated successfully at service stop).

- Description for the changelog

Disabled ostree fsync during bootstrapping via ephemeral bind mount to save ~3-5s on rebase

openshift-merge-bot · 2026-06-02T18:10:38Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

coderabbitai · 2026-06-02T18:10:50Z

Walkthrough

Adds transient ExecStartPre steps to copy and modify the ostree repo config into a tmpfs, bind-mount it over /sysroot/ostree/repo/config to set core.fsync=false for the unit lifetime, runs existing podman firstboot commands, and unmounts the bind in ExecStopPost.

Changes

OStree fsync toggle during firstboot

Layer / File(s)	Summary
Machine-config-daemon firstboot fsync toggle `templates/common/_base/units/machine-config-daemon-firstboot.service.yaml`	Adds `ExecStartPre` steps that create a tmpfs-backed copy of the ostree repo config, append `[core] fsync = false`, and bind-mount it over `/sysroot/ostree/repo/config` before running the existing podman `firstboot-complete-machineconfig` ExecStart commands; adds `ExecStopPost` to unmount the bind mount afterward.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 15

✅ Passed checks (15 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'machine-config-daemon-firstboot: disable ostree fsync during bootstrap' directly describes the main change: disabling ostree fsync during the firstboot phase by bind-mounting a tmpfs-backed config copy.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	This PR only modifies a YAML template file (machine-config-daemon-firstboot.service) and does not introduce or modify any Ginkgo test files. The check is not applicable to non-test files.
Test Structure And Quality	✅ Passed	This PR modifies only a systemd service YAML template file; no Ginkgo test code or test files are added or modified, making the check not applicable.
Microshift Test Compatibility	✅ Passed	No Ginkgo e2e tests are added in this PR. The only change is to a systemd service unit YAML file, which is not subject to the MicroShift test compatibility check.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This PR only modifies a YAML service unit file. No new Ginkgo e2e tests are added, so the SNO compatibility check does not apply.
Topology-Aware Scheduling Compatibility	✅ Passed	The PR modifies a systemd unit file template, not a Kubernetes deployment or operator code. No Kubernetes scheduling constraints are introduced.
Ote Binary Stdout Contract	✅ Passed	PR modifies a systemd unit YAML file, not OTE binary code. The check is inapplicable as it concerns OTE binaries communicating with openshift-tests.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR only modifies a YAML systemd service template file. No new Ginkgo e2e tests are added, so this check does not apply.
No-Weak-Crypto	✅ Passed	PR adds a systemd service unit file that disables ostree fsync during bootstrap via bind-mounting. No weak crypto algorithms, custom implementations, or non-constant-time secret comparisons detected.
Container-Privileges	✅ Passed	PR introduces privileged (--privileged, --net=host, --pid=host) podman containers in machine-config-daemon-firstboot.service for legitimate system bootstrap with clear justification.
No-Sensitive-Data-In-Logs	✅ Passed	The PR adds one systemd service file with only standard system configuration commands. No sensitive data, credentials, or PII are exposed in logging.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sdodson · 2026-06-02T18:11:49Z

  # Run this via podman because we want to use the nmstatectl binary in our container
  ExecStart=/usr/bin/podman run --rm --privileged --net=host -v /:/rootfs  --entrypoint machine-config-daemon '{{ .Images.machineConfigOperator }}' firstboot-complete-machineconfig --persist-nics
  ExecStart=/usr/bin/podman run --rm --privileged --pid=host --net=host -v /:/rootfs  --entrypoint machine-config-daemon '{{ .Images.machineConfigOperator }}' firstboot-complete-machineconfig
+  ExecStopPost=/usr/bin/ostree config --repo=/sysroot/ostree/repo set core.fsync true


Maybe unset is better?

This also loses whatever setting might've been set directly.
Also agree with the coderabbit comment re. error handling.

One way to structure this is to copy the config to somewhere in /run and then bind-mount on top of the config and then ostree config set it. That way it's really obvious that it's scoped to firstboot.

openshift-ci · 2026-06-02T18:13:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sdodson
Once this PR has been reviewed and has the lgtm label, please assign yuqi-zhang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@templates/common/_base/units/machine-config-daemon-firstboot.service.yaml`:
- Around line 19-23: The ExecStartPre/ExecStopPost pair in
machine-config-daemon-firstboot.service.yaml currently disables core.fsync and
only re-enables it on clean stop, so add an idempotent recovery that always
ensures ostree core.fsync is true on subsequent boots: implement a small
follow-up systemd unit (or add to the existing machine-config-daemon.service)
that runs ostree config --repo=/sysroot/ostree/repo set core.fsync true
unconditionally at boot (or checks and sets it if false), enable it with
appropriate After= and WantedBy= so it always runs on startup, and keep the
original ExecStartPre/ExecStopPost as-is for the performance window.
- Line 19: The ExecStartPre that disables ostree core.fsync is currently not
marked as best-effort and will fail the unit if it errors; update the
machine-config-daemon-firstboot.service template so the ExecStartPre that runs
"/usr/bin/ostree ... set core.fsync false" is prefixed with "-" to make it
best-effort (so failures don't block firstboot), and also add a preceding
best-effort ExecStartPre that attempts to reset core.fsync to true
("/usr/bin/ostree ... set core.fsync true") to guard against previous runs
leaving fsync disabled; reference the ExecStartPre lines manipulating
"core.fsync" in the service unit and ensure both are best-effort so they won't
cause unit failure.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e1649fad-a0af-48f6-b2ce-d0cf41d462d6

📥 Commits

Reviewing files that changed from the base of the PR and between bf66c0b and 9afb1fc.

📒 Files selected for processing (1)

templates/common/_base/units/machine-config-daemon-firstboot.service.yaml

jlebon

Ideally, the way we'd add a e.g. --disable-fsync to rpm-ostree rebase (or ostree container image pull) when we call it. But the even better fix here is to probably something like --disable-layers-fsync which skips it only when committing layers.

jlebon · 2026-06-02T19:35:47Z

  # Run this via podman because we want to use the nmstatectl binary in our container
  ExecStart=/usr/bin/podman run --rm --privileged --net=host -v /:/rootfs  --entrypoint machine-config-daemon '{{ .Images.machineConfigOperator }}' firstboot-complete-machineconfig --persist-nics
  ExecStart=/usr/bin/podman run --rm --privileged --pid=host --net=host -v /:/rootfs  --entrypoint machine-config-daemon '{{ .Images.machineConfigOperator }}' firstboot-complete-machineconfig
+  ExecStopPost=/usr/bin/ostree config --repo=/sysroot/ostree/repo set core.fsync true


This also loses whatever setting might've been set directly.
Also agree with the coderabbit comment re. error handling.

One way to structure this is to copy the config to somewhere in /run and then bind-mount on top of the config and then ostree config set it. That way it's really obvious that it's scoped to firstboot.

sdodson · 2026-06-03T14:40:33Z

Ideally, the way we'd add a e.g. --disable-fsync to rpm-ostree rebase (or ostree container image pull) when we call it. But the even better fix here is to probably something like --disable-layers-fsync which skips it only when committing layers.

I can pursue the bindmount option if you think that's good enough to move forward.

If we really want to pursue the rpm-ostree change that feels like it's not worth it to achieve 4-10s improvement, at this time. We're taking ~ 40 seconds to rebase from RHCOS to OCP Node even when 100% of RHCOS chunks are present and if we're going to invest that level of effort I'd hope to get that cut at least in half. Here's the breakdown based on 4.22 logs

  Summary: The "51 chunks already present" saves what would be ~200-400 MB of base RHCOS content in an older-version scenario. But even with a perfect version match, you still pay for:
  1. ~22s for the extensions sidecar (always pulled from registry, never on AMI)
  2. ~22s for 219.6 MB of custom OCP layers (kubelet/CRI-O/OVS — same deal)
  3. ~10s of silent SELinux work
  4. ~9s of ostree staging

Number 1 is solved in #5905 just not in 4.22.

The time savings may not actually come via rpm-ostree but perhaps by fetching content for number 2 ahead of this stage, there's already about 30 seconds between when the network is up and this stage. But I'm not sure what item 3 is and if we can address that. Here's number 2 in detail from the logs.

  02:53:25  rpm-ostree rebase invoked
  02:53:26  ostree chunk layers already present: 51  ← the "same version" payoff
  02:53:26  custom layers needed: 2 (219.6 MB)      ← but these are always missing
  02:53:48  [0/2] Fetching bdc94d91... (219.6 MB)...done
  02:53:48  [1/2] Fetching 692e2beb... (6.6 kB)...done

I'm curious what's happening in number 3 and if it can be avoided in some way.

dustymabe · 2026-06-03T16:07:57Z

add a e.g. --disable-fsync to rpm-ostree rebase (or ostree container image pull) when we call it. But the even better fix here is to probably something like --disable-layers-fsync

I'm assuming here that --disable-fsync or --disable-layers-fsync don't exist today and are things we'd have to add to rpm-ostree. I agree with @sdodson it's not worh the effort to implement that for 4-10s of gain.

I can pursue the bindmount option if you think that's good enough to move forward.

I think it is. I thought about the bind mount option before but you implemented it this way and I thought it was good enough. That's a long way of me saying I do like the bind mount option better if it works as well.

dustymabe · 2026-06-03T16:11:53Z

The time savings may not actually come via rpm-ostree but perhaps by fetching content for number 2 ahead of this stage, there's already about 30 seconds between when the network is up and this stage.

Is that 30s number true even with #5905 ?

jlebon · 2026-06-04T14:18:07Z

add a e.g. --disable-fsync to rpm-ostree rebase (or ostree container image pull) when we call it. But the even better fix here is to probably something like --disable-layers-fsync

I'm assuming here that --disable-fsync or --disable-layers-fsync don't exist today and are things we'd have to add to rpm-ostree. I agree with @sdodson it's not worh the effort to implement that for 4-10s of gain.

Not for this effort, I agree. But it's generally desirable of course. The impact of this likely gets worse the more layers you have.

I can pursue the bindmount option if you think that's good enough to move forward.

I think it is. I thought about the bind mount option before but you implemented it this way and I thought it was good enough. That's a long way of me saying I do like the bind mount option better if it works as well.

Good enough for me too to be clear!

sdodson · 2026-06-04T19:50:16Z

/hold
Debugging, it's failing to scale new nodes.

@dustymabe

Saves a few seconds, measured at around 4-10s. Assumes we tolerate the outcome that corruption due to powerfailure means we have to re-ignite a node from scratch. Suggested by @dustymabe and original analysis at https://github.com/dustymabe/20250529-ostree-fsync-analysis Rather than writing fsync=false to disk and re-enabling it on stop, we bind mount a tmpfs-backed copy of the ostree repo config over the real path. The ostree config command then writes to the bind mount instead of the underlying file, so the on-disk config is never modified. If the host crashes or reboots before ExecStopPost runs, the kernel tears down the bind mount and the original config is automatically restored.

…ring firstboot Two related optimizations to reduce I/O contention during node scale-up: 1. Defer rpm-ostree rollback cleanup by 60s +jitter (removeRollback) On boot 2, MCD calls `rpm-ostree cleanup -r` shortly after starting, which triggers three syncfs() calls totalling ~43s. These run concurrently with CNI image pulls and compete for disk bandwidth during the kubelet-to-ready window. Sleeping before cleanup lets the node reach Ready first. Jitter (up to 30s) spreads cleanup across nodes that scale up in the same burst. 2. Disable ostree fsync during machine-config-daemon-firstboot (from openshift#6122) Bind mount a tmpfs-backed copy of the ostree repo config over the real path before the rebase so that setting core.fsync=false is ephemeral. If the host crashes or reboots, the bind mount disappears and the on-disk config is unchanged automatically. We append a second [core] section (GKeyFile merges duplicate groups) rather than using `ostree config set` because ostree uses an atomic rename which fails with EBUSY on a bind-mounted file. Saves ~3-5s on rebase fetch. Measured at n=18 on AWS m6a.xlarge.

openshift-ci · 2026-06-05T18:03:52Z

@sdodson: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

sdodson commented Jun 2, 2026

View reviewed changes

openshift-ci Bot requested review from isabella-janssen and yuqi-zhang June 2, 2026 18:13

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread templates/common/_base/units/machine-config-daemon-firstboot.service.yaml Outdated

Comment thread templates/common/_base/units/machine-config-daemon-firstboot.service.yaml Outdated

jlebon reviewed Jun 2, 2026

View reviewed changes

sdodson force-pushed the mcd-firstboot-fsync branch from 9afb1fc to c33eca9 Compare June 4, 2026 14:52

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 4, 2026

sdodson force-pushed the mcd-firstboot-fsync branch from c33eca9 to a50480e Compare June 5, 2026 17:42

sdodson mentioned this pull request Jun 5, 2026

daemon: skip nice/ionice for extensions image pull during firstboot #6139

Draft

Conversation

sdodson commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-merge-bot Bot commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

sdodson Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

jlebon Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jun 2, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jlebon left a comment

Choose a reason for hiding this comment

Uh oh!

jlebon Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

sdodson commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dustymabe commented Jun 3, 2026

Uh oh!

dustymabe commented Jun 3, 2026

Uh oh!

jlebon commented Jun 4, 2026

Uh oh!

sdodson commented Jun 4, 2026

Uh oh!

openshift-ci Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sdodson commented Jun 2, 2026 •

edited

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

sdodson commented Jun 3, 2026 •

edited

Loading