Skip to content

daemon: skip nice/ionice for extensions image pull during firstboot#6139

Draft
sdodson wants to merge 2 commits into
openshift:mainfrom
sdodson:worktree-firstboot-foreground-podman
Draft

daemon: skip nice/ionice for extensions image pull during firstboot#6139
sdodson wants to merge 2 commits into
openshift:mainfrom
sdodson:worktree-firstboot-foreground-podman

Conversation

@sdodson
Copy link
Copy Markdown
Member

@sdodson sdodson commented Jun 3, 2026

Two related optimizations to reduce I/O contention during node scale-up, measured on AWS m6a.xlarge (4.22 nightly, n=18 for fsync, n=1 baseline for cleanup contention).

1. Defer rpm-ostree rollback cleanup by 60s + jitter

On boot 2, MCD calls rpm-ostree cleanup -r shortly after starting. This triggers three syncfs() calls from the rpmostree daemon totalling ~43s, which run concurrently with CNI image pulls and compete for disk bandwidth during the kubelet-to-ready window. The syncfs calls straddle NodeReady — the first two complete before it, and the third finishes ~11s after.

Sleeping 60s + up to 30s jitter in removeRollback() before calling rpm-ostree cleanup -r defers the I/O until after the node is Ready. The sleep is inside checkStateOnFirstRun() which runs before dn.booting = false and before MachineConfigNodeResumed is posted, so MCO's Done state is delayed by the sleep — but NodeReady itself is kubelet-driven and is unaffected.

Jitter spreads cleanup across nodes that scale up in the same burst so they don't all hit the disk simultaneously when the delay expires.

2. Disable ostree fsync during machine-config-daemon-firstboot (includes #6122)

Bind mount a tmpfs-backed copy of the ostree repo config over /sysroot/ostree/repo/config before the rebase so that setting core.fsync=false is ephemeral. If the host crashes or reboots, the bind mount disappears and the on-disk config is unchanged automatically.

We append a second [core] section to the tmpfs copy (GKeyFile merges duplicate groups) rather than using ostree config set because ostree uses an atomic rename which fails with EBUSY on a bind-mounted file. Saves ~3-5s on rebase fetch.

- How to verify

For cleanup deferral: on a new node's boot 2 journal, the rpm-ostree[N]: Starting syncfs for system repo entries from machine-config-operator client should now appear >60s after kubelet registers, rather than concurrent with CNI pulls.

For fsync: the many per-object Starting/Completed syncfs() for system repo entries from rpm-ostree during the rebase window should be absent; sysroot-ostree-repo-config.mount: Deactivated successfully confirms the bind mount was active.

- Description for the changelog

Deferred rpm-ostree rollback cleanup to after NodeReady to reduce boot I/O contention; disabled ostree fsync during firstboot rebase to save ~3-5s

Summary by CodeRabbit

  • Bug Fixes

    • Improved control of extension image extraction during OS updates to avoid background extraction on first boot.
    • Reduced boot I/O contention by deferring cleanup work with a randomized delay.
    • Made first-boot service behavior ephemeral by bind-mounting a tmpfs-backed config to disable fsync only during first boot.
  • Tests

    • Updated tests to reflect the revised extension image extraction behavior.

During firstboot no workloads are running on the node yet, so
throttling podman to idle I/O and low CPU priority only slows down
provisioning. Thread a firstBoot flag from update() down through
applyOSChanges/applyLayeredOSChanges/ExtractExtensionsImage to
pullExtensionsImage and podmanCopy, switching from RunExtBackground
to RunExt when firstBoot is true.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 3, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 3, 2026

Walkthrough

Threads a background flag through extensions image pull/copy so podman operations can run in background or foreground. Threads a firstBoot boolean through OS update paths to disable background extraction on first boot. Adds jittered rpm-ostree cleanup delay, updates a unit test, and bind-mounts a tmpfs OSTree config during firstboot.

Changes

Daemon update and tests

Layer / File(s) Summary
Image extraction functions with background flag
pkg/daemon/update.go
pullExtensionsImage and podmanCopy accept a background bool and choose pivotutils.RunExtBackground vs pivotutils.RunExt. CoreOSDaemon.ExtractExtensionsImage signature updated to accept and forward background.
OS update orchestration with first-boot threading
pkg/daemon/update.go
applyOSChanges and applyLayeredOSChanges accept firstBoot and propagate it; layered extraction calls ExtractExtensionsImage(..., !firstBoot) to avoid background extraction on first boot. Daemon.update and rollback paths now pass firstBoot; HyperShift path uses firstBoot=false.
Jittered rollback cleanup
pkg/daemon/update.go
removeRollback sleeps a randomized jitter before invoking runRpmOstree("cleanup", "-r"); adds math/rand import.
Test updated for podmanCopy signature
pkg/daemon/update_test.go
TestPodmanCopy_CreateContainerCall now passes the new background boolean when calling podmanCopy.

Firstboot service unit

Layer / File(s) Summary
Firstboot tmpfs bind-mounted OSTree config
templates/common/_base/units/machine-config-daemon-firstboot.service.yaml
Adds ExecStartPre/ExecStopPost steps that copy the OSTree repo config into /run, append a [core] section with fsync=false, bind-mount it over /sysroot/ostree/repo/config during firstboot, and unmount on stop so the change is not persisted.

🎯 3 (Moderate) | ⏱️ ~18 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error ExtractExtensionsImage error includes imgURL which could contain auth credentials (docker://user:pass@registry/image), exposing secrets in logs. Sanitize container image URLs before including in error messages to prevent exposure of embedded credentials in logs.
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes a core change: skipping nice/ionice (background execution) for extensions image pull during firstboot, which aligns with the main optimization objective to avoid I/O throttling during provisioning.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo tests present in PR. The modified test file (update_test.go) uses standard Go testing (func TestXXX), not Ginkgo. All test names are static and descriptive.
Test Structure And Quality ✅ Passed The PR modifies pkg/daemon/update_test.go which uses standard Go testing with testify, not Ginkgo. The custom check applies only to Ginkgo tests, so it is not applicable.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. Only daemon code, unit tests, and systemd templates were modified.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests were added. Changes are to daemon code, a Go unit test, and a systemd service file.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies internal daemon Go functions and a Systemd unit file—not Kubernetes manifests or controllers. No scheduling constraints (affinity, node selectors, topology spread, PDB) are introduced.
Ote Binary Stdout Contract ✅ Passed The new klog.Infof in removeRollback() runs from syncNode during daemon operation, not initialization. machine-config-daemon binary already sets logtostderr=true in start.go.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds no Ginkgo e2e tests. Modified files are: daemon implementation (update.go), standard Go unit test (update_test.go), and systemd service file. Check is not applicable.
No-Weak-Crypto ✅ Passed PR introduces no weak crypto algorithms, custom crypto, or non-constant-time secret comparisons. The math/rand usage is for scheduling jitter, not cryptographic purposes.
Container-Privileges ✅ Passed The PR modifies Go source code and adds a systemd service template (not a K8s manifest). No K8s containers/manifests with privileged security issues are introduced.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Command failed


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 3, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sdodson
Once this PR has been reviewed and has the lgtm label, please assign pablintino for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ring firstboot

Two related optimizations to reduce I/O contention during node scale-up:

1. Defer rpm-ostree rollback cleanup by 60s +jitter (removeRollback)

   On boot 2, MCD calls `rpm-ostree cleanup -r` shortly after starting,
   which triggers three syncfs() calls totalling ~43s. These run
   concurrently with CNI image pulls and compete for disk bandwidth during
   the kubelet-to-ready window. Sleeping before cleanup lets the node reach
   Ready first. Jitter (up to 30s) spreads cleanup across nodes that scale
   up in the same burst.

2. Disable ostree fsync during machine-config-daemon-firstboot (from openshift#6122)

   Bind mount a tmpfs-backed copy of the ostree repo config over the real
   path before the rebase so that setting core.fsync=false is ephemeral.
   If the host crashes or reboots, the bind mount disappears and the
   on-disk config is unchanged automatically.

   We append a second [core] section (GKeyFile merges duplicate groups)
   rather than using `ostree config set` because ostree uses an atomic
   rename which fails with EBUSY on a bind-mounted file.

   Saves ~3-5s on rebase fetch. Measured at n=18 on AWS m6a.xlarge.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/daemon/update.go (1)

1497-1504: 🏗️ Heavy lift

Make the rollback-cleanup deferral cancelable.

This adds an uninterruptible 60–90s sleep in a daemon path. If shutdown/update cancellation happens during that window, the goroutine still blocks until the timer expires. Thread a context.Context (or the daemon’s existing stop signal) into removeRollback() and wait on a timer with select instead.

♻️ Suggested direction
-func (dn *Daemon) removeRollback() error {
+func (dn *Daemon) removeRollback(ctx context.Context) error {
 	if !dn.os.IsCoreOSVariant() {
 		return nil
 	}

 	delay := 60*time.Second + time.Duration(rand.Int63n(int64(30*time.Second)))
 	klog.Infof("Deferring rpm-ostree rollback cleanup by %s to reduce boot I/O contention", delay.Round(time.Second))
-	time.Sleep(delay)
+
+	timer := time.NewTimer(delay)
+	defer timer.Stop()
+	select {
+	case <-timer.C:
+	case <-ctx.Done():
+		return ctx.Err()
+	}
+
 	return runRpmOstree("cleanup", "-r")
 }

As per coding guidelines, **/*.go: context.Context for cancellation and timeouts.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/daemon/update.go` around lines 1497 - 1504, The current uninterruptible
sleep in removeRollback() (the delay := 60*time.Second + ...; time.Sleep(delay))
must be made cancelable: add a context.Context parameter (or use the daemon's
existing stop channel/context) to removeRollback and replace time.Sleep with a
timer and a select that waits on timer.C and ctx.Done() so the deferred
rpm-ostree cleanup can be aborted on shutdown/update cancellation; ensure the
log still reports the intended delay when scheduled and that the function
returns promptly on ctx cancellation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/daemon/update.go`:
- Around line 1497-1504: The current uninterruptible sleep in removeRollback()
(the delay := 60*time.Second + ...; time.Sleep(delay)) must be made cancelable:
add a context.Context parameter (or use the daemon's existing stop
channel/context) to removeRollback and replace time.Sleep with a timer and a
select that waits on timer.C and ctx.Done() so the deferred rpm-ostree cleanup
can be aborted on shutdown/update cancellation; ensure the log still reports the
intended delay when scheduled and that the function returns promptly on ctx
cancellation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f74aebfb-a6e1-4ea7-83c7-6cf6de1dc806

📥 Commits

Reviewing files that changed from the base of the PR and between 67fe742 and f2507cb.

📒 Files selected for processing (2)
  • pkg/daemon/update.go
  • templates/common/_base/units/machine-config-daemon-firstboot.service.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant