OCPBUGS-84258: fsync static pod cert and manifest writes for crash durability by sanchezl · Pull Request #2176 · openshift/library-go

sanchezl · 2026-04-24T13:28:57Z

Summary

On SNO clusters, ungraceful shutdown can cause kube-apiserver cert files to be truncated or lost, rendering the cluster inoperable. Two code paths write critical files without fsync:

atomicdir.Sync writes files to a staging directory, atomically swaps it with the target via renameat2(RENAME_EXCHANGE), then deletes the old data. Without fsync, file data lives only in the kernel page cache. On ungraceful shutdown the journal replays the swap and deletion (metadata), but the file data was never flushed, leaving truncated or empty files.
installerpod.writePod uses bare os.WriteFile plus a delete-then-write pattern for kubelet manifests. On ungraceful shutdown, the delete is journaled but the new file data may not have reached disk, leaving the manifest missing.

Changes

New `fsutil` package

Introduces pkg/operator/staticpod/internal/fsutil with two durable I/O primitives:

Fsync — fsyncs a file or directory, checking both sync and close errors.
WriteFileFsync — writes a file, fsyncs the file, and fsyncs the parent directory to ensure both the data and the directory entry are durable on disk.

atomicdir: fsync files and directories for crash durability

The fileSystem struct field WriteFile is renamed to WriteFileFsync and now uses fsutil.WriteFileFsync instead of os.WriteFile, so each file write is individually durable (file data + parent directory entry).
New Fsync field on the fileSystem struct (set to fsutil.Fsync) enables mocking fsync failures in tests.
After the atomic swap, parent directories of both target and staging are fsynced via fs.Fsync to persist the renameat2(RENAME_EXCHANGE) result.

installerpod: fsync pod manifest writes for crash durability

Replace os.WriteFile with fsutil.WriteFileFsync for both the resource directory and kubelet manifest writes, ensuring file data and directory entries are durable before the function returns.

Test plan

Existing TestSync cases continue to pass (fsync is transparent for happy path)
New fsutil unit tests verify WriteFileFsync and Fsync (content, permissions, error paths)
Error injection tests verify graceful failure when mkdir, write, or swap fails at each stage
Two new parent fsync failure tests verify that when fsync fails after a successful swap, the directory is synchronized but an error is returned

openshift-ci-robot · 2026-04-24T13:29:03Z

@sanchezl: This pull request references Jira Issue OCPBUGS-84258, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

atomicdir.Sync writes files to a staging directory, atomically swaps it with the target directory via renameat2(RENAME_EXCHANGE), then deletes the old data. Without fsync, file data lives only in the kernel page cache. On ungraceful shutdown the journal replays the swap and deletion (metadata), but the file data was never flushed, leaving truncated or empty files.

This causes SNO clusters to lose all kube-apiserver certificates on ungraceful shutdown (power loss, forced reboot), rendering them inoperable.

Changes

Add fsync calls to ensure data durability before old data is deleted:

fsync each file after writing to the staging directory

fsync the staging directory to persist directory entries

Perform the atomic swap (existing)

fsync parent directories of both target and staging to persist the swap — renameat2(RENAME_EXCHANGE) modifies parent directory dentries, so the parents must be fsynced to persist which inode each name points to

New SyncPath field on the fileSystem struct enables mocking in tests.

Test plan

Existing TestSync cases continue to pass (fsync is transparent for happy path)

4 new error injection tests verify graceful failure when fsync fails at each stage:

File sync failure → target directory unchanged

Staging directory sync failure → target directory unchanged

Parent of target sync failure → swap already happened, error returned

Parent of staging sync failure → swap already happened, error returned

New TestSyncOperationOrdering records the sequence of filesystem calls and asserts the critical ordering invariants: all writes before file syncs, all file syncs before staging dir sync, staging dir sync before swap, swap before parent syncs, parent syncs before RemoveAll

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-04-24T13:29:10Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds durability: writes use fsutil.WriteFileFsync; atomic directory exchange is performed by a new swapFsync that fsyncs parent directories after the rename-exchange; installer manifest writes now use fsync-backed writes. One test failure scenario for SwapDirectories was removed.

Changes

Cohort / File(s)	Summary
Atomic Directory Sync Implementation `pkg/operator/staticpod/internal/atomicdir/sync.go`	Replaces use of a filesystem swap hook with an internal `swapFsync` that performs the atomic rename-exchange and fsyncs parent directories; staged files are written with fsync-backed writes.
Atomic Directory Sync Tests `pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go`	Removes the test case that expected `sync` to fail when `SwapDirectories` returned an error; other tests remain.
Installer Manifest Writes `pkg/operator/staticpod/installerpod/cmd.go`	`writePod` now writes manifests using `fsutil.WriteFileFsync` and wraps write errors with the target path in the returned error.
Filesystem Durability Utilities `pkg/operator/staticpod/internal/fsutil/fsutil.go`, `pkg/operator/staticpod/internal/fsutil/fsutil_test.go`	Adds `SyncPath` and `WriteFileFsync` helpers with tests: durable file writes (fsync file and parent dir) and path fsync behavior; includes unit tests for both functions.

Sequence Diagram(s)

sequenceDiagram
    participant Writer as Writer
    participant FS as fsutil
    participant Staging as StagingDir
    participant Swap as renameat2(RENAME_EXCHANGE)
    participant Parents as ParentDirs

    Writer->>FS: WriteFileFsync(staging/file)
    FS->>FS: fsync file\nclose file\nfsync staging dir
    Writer->>Swap: swap staging <-> target (rename_exchange)
    Swap->>Parents: SyncPath(parent(target))
    Swap->>Parents: SyncPath(parent(staging))
    Parents->>Writer: swap complete

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	❓ Inconclusive	The custom check is designed for Ginkgo test code, but this PR contains only standard Go testing package tests without Ginkgo patterns.	Clarify whether this check applies only to Ginkgo repos or should be adapted for standard Go tests with revised criteria.

✅ Passed checks (10 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding fsync operations to static pod certificate and manifest writes to ensure crash durability. It references a bug ticket and directly relates to the core objective of the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	The pull request introduces standard Go unit tests using testing.T with table-driven test cases. All test names are static, deterministic strings without dynamic information.
Microshift Test Compatibility	✅ Passed	This PR adds no new Ginkgo e2e tests. The test additions are standard Go unit tests using the testing package to test internal library functions for fsync durability.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This PR does not add any new Ginkgo e2e tests; it only contains standard Go unit tests using the testing package.
Topology-Aware Scheduling Compatibility	✅ Passed	The shell results confirm that the modified files contain only filesystem-level durability improvements (fsync operations, directory swaps, error handling) with no scheduling constraints whatsoever. Searches for affinity, tolerations, node selectors, pod disruption budgets, or topology keywords in the modified code returned no results. The changes are limited to helper functions and their usage in the static pod sync and installer code, with no pod specifications, deployment manifests, or scheduling policies being altered. This PR is topology-agnostic and will not break on SNO, Two-Node Fixed, Two-Node with Arbiter, or HyperShift clusters.
Ote Binary Stdout Contract	✅ Passed	This PR modifies library code in openshift/library-go repository with no process-level entry points writing to stdout; only klog.Infof() and fmt.Sprintf() are used.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR adds standard Go unit tests, not Ginkgo e2e tests. Tests use only filesystem operations with no network dependencies or IPv4 assumptions.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go (1)

550-597: Assert the required fsync operations exist, not just their relative order.

As written, this can still pass when a whole class of ops disappears. firstFileSync/firstParentSync default to len(ops), so removing all file fsyncs or both parent fsyncs won’t necessarily fail the test. That weakens the regression guard for the new durability contract.

Suggested tightening

 	lastWrite := -1
 	firstFileSync := len(ops)
 	lastFileSync := -1
+	fileSyncCount := 0
 	dirSyncIdx := -1
 	swapIdx := -1
 	firstParentSync := len(ops)
 	lastParentSync := -1
+	parentSyncCount := 0
 	removeIdx := -1

 	for i, o := range ops {
 		switch o.kind {
 		case opWriteFile:
 			lastWrite = i
 		case opSyncFile:
+			fileSyncCount++
 			if i < firstFileSync {
 				firstFileSync = i
 			}
 			lastFileSync = i
 		case opSyncDir:
 			dirSyncIdx = i
 		case opSwap:
 			swapIdx = i
 		case opSyncParent:
+			parentSyncCount++
 			if i < firstParentSync {
 				firstParentSync = i
 			}
 			lastParentSync = i
 		case opRemoveAll:
 			removeIdx = i
 		}
 	}

+	if fileSyncCount != len(files) {
+		t.Fatalf("expected %d file syncs, got %d", len(files), fileSyncCount)
+	}
+	if dirSyncIdx == -1 {
+		t.Fatal("expected a staging directory sync")
+	}
+	if swapIdx == -1 {
+		t.Fatal("expected a directory swap")
+	}
+	if parentSyncCount != 2 {
+		t.Fatalf("expected 2 parent syncs, got %d", parentSyncCount)
+	}
+	if removeIdx == -1 {
+		t.Fatal("expected staging directory removal")
+	}
+
 	if lastWrite >= firstFileSync {
 		t.Errorf("all writes must complete before any file sync: last write at %d, first file sync at %d", lastWrite, firstFileSync)
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go` around lines
550 - 597, The test currently only checks ordering using sentinel defaults,
which lets whole classes of ops be missing; add explicit existence assertions
for the critical fsync/swap/remove ops before doing ordering checks: assert that
firstFileSync != len(ops) (i.e. at least one opSyncFile exists) and
firstParentSync != len(ops) (at least one opSyncParent exists) and also assert
dirSyncIdx != -1, swapIdx != -1, and removeIdx != -1, returning t.Errorf with
clear messages if any are missing, then keep the existing relative-order checks
that reference ops, opSyncFile, opSyncParent, dirSyncIdx, swapIdx, and
removeIdx.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/staticpod/internal/atomicdir/sync.go`:
- Around line 40-46: The syncPath function currently ignores the error returned
by f.Close(), so change it to capture and propagate Close() errors: call
f.Sync(), capture its error into a variable (errSync), then call f.Close() and
capture its error (errClose), and return errClose if errSync is nil else return
errSync (or combine appropriately so Close errors are not lost); update the
function named syncPath to ensure both Sync() and Close() errors are checked and
the Close() error is returned when Sync() succeeded.

---

Nitpick comments:
In `@pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go`:
- Around line 550-597: The test currently only checks ordering using sentinel
defaults, which lets whole classes of ops be missing; add explicit existence
assertions for the critical fsync/swap/remove ops before doing ordering checks:
assert that firstFileSync != len(ops) (i.e. at least one opSyncFile exists) and
firstParentSync != len(ops) (at least one opSyncParent exists) and also assert
dirSyncIdx != -1, swapIdx != -1, and removeIdx != -1, returning t.Errorf with
clear messages if any are missing, then keep the existing relative-order checks
that reference ops, opSyncFile, opSyncParent, dirSyncIdx, swapIdx, and
removeIdx.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: df5c87bc-3dab-4244-ad9a-d8b6565a9d60

📥 Commits

Reviewing files that changed from the base of the PR and between 18e7937 and c7a1715.

📒 Files selected for processing (2)

pkg/operator/staticpod/internal/atomicdir/sync.go
pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go

coderabbitai

🧹 Nitpick comments (1)

pkg/operator/staticpod/installerpod/cmd.go (1)

628-633: Wrap fsync errors with path/operation context.

These branches currently return raw errors, which makes on-node diagnosis hard when one of the four sync steps fails.

Proposed diff

-	if err := syncPath(path.Join(resourceDir, manifestFileName)); err != nil {
-		return err
+	if err := syncPath(path.Join(resourceDir, manifestFileName)); err != nil {
+		return fmt.Errorf("failed syncing resource manifest file %q: %w", path.Join(resourceDir, manifestFileName), err)
 	}
-	if err := syncPath(resourceDir); err != nil {
-		return err
+	if err := syncPath(resourceDir); err != nil {
+		return fmt.Errorf("failed syncing resource manifest directory %q: %w", resourceDir, err)
 	}
@@
-	if err := syncPath(path.Join(o.PodManifestDir, manifestFileName)); err != nil {
-		return err
+	if err := syncPath(path.Join(o.PodManifestDir, manifestFileName)); err != nil {
+		return fmt.Errorf("failed syncing pod manifest file %q: %w", path.Join(o.PodManifestDir, manifestFileName), err)
 	}
-	if err := syncPath(o.PodManifestDir); err != nil {
-		return err
+	if err := syncPath(o.PodManifestDir); err != nil {
+		return fmt.Errorf("failed syncing pod manifest directory %q: %w", o.PodManifestDir, err)
 	}

Also applies to: 645-650

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/staticpod/installerpod/cmd.go` around lines 628 - 633, The
syncPath calls currently return raw errors which lack context; update both
places that call syncPath (the calls involving syncPath(path.Join(resourceDir,
manifestFileName)) and syncPath(resourceDir), and the similar block at the later
lines) to wrap the returned error with operation and path information—e.g., on
error return fmt.Errorf("fsync %s: %w", path.Join(resourceDir,
manifestFileName), err) or an equivalent errors.Wrapf so the error message
includes the failing operation and the affected path and use the same pattern
for resourceDir and the other two sync steps.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/operator/staticpod/installerpod/cmd.go`:
- Around line 628-633: The syncPath calls currently return raw errors which lack
context; update both places that call syncPath (the calls involving
syncPath(path.Join(resourceDir, manifestFileName)) and syncPath(resourceDir),
and the similar block at the later lines) to wrap the returned error with
operation and path information—e.g., on error return fmt.Errorf("fsync %s: %w",
path.Join(resourceDir, manifestFileName), err) or an equivalent errors.Wrapf so
the error message includes the failing operation and the affected path and use
the same pattern for resourceDir and the other two sync steps.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f3c08e63-75fb-4082-96fc-710451db2fde

📥 Commits

Reviewing files that changed from the base of the PR and between c7a1715 and f8e0ee1.

📒 Files selected for processing (1)

pkg/operator/staticpod/installerpod/cmd.go

p0lyn0mial · 2026-04-24T14:00:59Z

@tchap hey, could you take a look at this pr ?

sanchezl · 2026-04-27T04:30:42Z

PROOF results (via CKAO#2124)

A PROOF PR was created on cluster-kube-apiserver-operator to validate this fix end-to-end.

CI: 13/13 green

All CI checks passed, including unit tests, verify, images, and all 7 e2e jobs.

Payload jobs: 3/3 passed

Job	Status	Tracking
e2e-aws-ovn-single-node	✅ Passed	run
e2e-aws-upgrade-ovn-single-node	✅ Passed	run
e2e-gcp-ovn-upgrade	✅ Passed	run

SNO and SNO-upgrade payloads were chosen because the customer environment is SNO edge clusters with ungraceful shutdowns during upgrades. GCP-upgrade covers multi-node upgrade regression.

What this proves

Direct proof (unit tests): The unit test JUnit shows all new atomicdir tests passed on Linux CI:

TestSync/directory_unchanged_on_failed_to_sync_a_file — PASS
TestSync/directory_unchanged_on_failed_to_sync_staging_directory — PASS
TestSync/directory_synchronized_on_failed_to_sync_parent_of_target_directory — PASS
TestSync/directory_synchronized_on_failed_to_sync_parent_of_staging_directory — PASS
TestSyncOperationOrdering — PASS (instruments every filesystem call and asserts: writes → file fsyncs → staging dir fsync → swap → parent fsyncs → RemoveAll)

Integration proof (CKAO PROOF PR): The installer pod and cert-syncer both call atomicdir.Sync on every revision install and cert rotation. All e2e and payload jobs — including SNO and upgrade topologies — completed successfully, confirming the fsync calls execute without error and introduce no regressions.

openshift-ci-robot · 2026-04-27T19:45:47Z

@sanchezl: This pull request references Jira Issue OCPBUGS-84258, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

On SNO clusters, ungraceful shutdown can cause kube-apiserver cert files to be truncated or lost, rendering the cluster inoperable. Two code paths write critical files without fsync:

atomicdir.Sync writes files to a staging directory, atomically swaps it with the target via renameat2(RENAME_EXCHANGE), then deletes the old data. Without fsync, file data lives only in the kernel page cache. On ungraceful shutdown the journal replays the swap and deletion (metadata), but the file data was never flushed, leaving truncated or empty files.

installerpod.writePod uses bare os.WriteFile plus a delete-then-write pattern for kubelet manifests. On ungraceful shutdown, the delete is journaled but the new file data may not have reached disk, leaving the manifest missing.

Changes

atomicdir: fsync files and directories for crash durability

Add fsync calls to ensure data durability before old data is deleted:

fsync each file after writing to the staging directory

fsync the staging directory to persist directory entries

Perform the atomic swap (existing)

fsync parent directories of both target and staging to persist the swap — renameat2(RENAME_EXCHANGE) modifies parent directory dentries, so the parents must be fsynced to persist which inode each name points to

New SyncPath field on the fileSystem struct enables mocking in tests.

installerpod: fsync pod manifest writes for crash durability

Add fsync after each os.WriteFile call and fsync the parent directories to ensure both the resource copy and the kubelet manifest are durable before the function returns.

Test plan

Existing TestSync cases continue to pass (fsync is transparent for happy path)

4 new error injection tests verify graceful failure when fsync fails at each stage:

File sync failure → target directory unchanged

Staging directory sync failure → target directory unchanged

Parent of target sync failure → swap already happened, error returned

Parent of staging sync failure → swap already happened, error returned

New TestSyncOperationOrdering records the sequence of filesystem calls and asserts the critical ordering invariants: all writes before file syncs, all file syncs before staging dir sync, staging dir sync before swap, swap before parent syncs, parent syncs before RemoveAll

Summary by CodeRabbit

Improvements

Stronger filesystem synchronization for static pod manifests and atomic directory swaps to improve persistence and robustness; clearer logging around sync phases.

Bug Fixes

Enforced correct ordering of writes, file syncs, directory syncs, atomic swap, and parent-directory fsyncs to reduce risk of data loss or inconsistent state.

Tests

Added tests for sync-failure scenarios and to verify precise operation ordering during sync.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

🧹 Nitpick comments (1)

pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go (1)

540-541: Ordering test normalization is too broad and can mask regressions.

sortConsecutiveSameKindOps currently sorts all consecutive same-kind runs. That can hide order changes for deterministic stages (for example MkdirAll, SyncParent). Please limit normalization to map-iteration-driven kinds (WriteFile, SyncFile) only.

Proposed tightening

-	// Sort consecutive same-kind ops by path to normalize non-deterministic map iteration order.
-	sortConsecutiveSameKindOps(ops)
+	// Normalize only map-iteration-driven operations.
+	sortConsecutiveSameKindOps(ops, map[string]bool{
+		"WriteFile": true,
+		"SyncFile":  true,
+	})

-func sortConsecutiveSameKindOps(ops []fsOp) {
+func sortConsecutiveSameKindOps(ops []fsOp, sortableKinds map[string]bool) {
 	i := 0
 	for i < len(ops) {
 		j := i + 1
 		for j < len(ops) && ops[j].Kind == ops[i].Kind {
 			j++
 		}
-		if j-i > 1 {
+		if j-i > 1 && sortableKinds[ops[i].Kind] {
 			sort.Slice(ops[i:j], func(a, b int) bool {
 				return ops[i+a].Path < ops[i+b].Path
 			})
 		}
 		i = j
 	}
 }

Also applies to: 569-585

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go` around lines
540 - 541, The normalization is too broad: restrict sorting to only
map-iteration nondeterministic kinds (WriteFile and SyncFile) instead of sorting
all consecutive same-kind runs. Modify the use/implementation of
sortConsecutiveSameKindOps so it either accepts a predicate/whitelist or add a
new helper (e.g., sortConsecutiveSameKindOpsForMapKinds) that scans the ops
slice and only sorts runs whose kind is WriteFile or SyncFile (leave MkdirAll,
SyncParent and other kinds in their original order); update the two call sites
that currently call sortConsecutiveSameKindOps(ops) to call the tightened
helper/predicate so only WriteFile and SyncFile runs get reordered.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go`:
- Around line 540-541: The normalization is too broad: restrict sorting to only
map-iteration nondeterministic kinds (WriteFile and SyncFile) instead of sorting
all consecutive same-kind runs. Modify the use/implementation of
sortConsecutiveSameKindOps so it either accepts a predicate/whitelist or add a
new helper (e.g., sortConsecutiveSameKindOpsForMapKinds) that scans the ops
slice and only sorts runs whose kind is WriteFile or SyncFile (leave MkdirAll,
SyncParent and other kinds in their original order); update the two call sites
that currently call sortConsecutiveSameKindOps(ops) to call the tightened
helper/predicate so only WriteFile and SyncFile runs get reordered.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 447562ae-61c7-415d-93c4-963b1ae75ce4

📥 Commits

Reviewing files that changed from the base of the PR and between f8e0ee1 and f9ebb30.

📒 Files selected for processing (3)

pkg/operator/staticpod/installerpod/cmd.go
pkg/operator/staticpod/internal/atomicdir/sync.go
pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/operator/staticpod/installerpod/cmd.go

tchap · 2026-04-28T09:23:53Z

I took a look as requested,
/lgtm

p0lyn0mial · 2026-04-28T19:49:18Z

 var realFS = fileSystem{
 	MkdirAll:        os.MkdirAll,
 	RemoveAll:       os.RemoveAll,
 	WriteFile:       os.WriteFile,


could we replace os.WriteFile with WriteFileSync (we would have to implement it) ?
would that solve the file sync issue ?

It would only partially solve the issue as we'd still need syncPath for the directory and parent fsyncs.

While a WriteFileSync helper would work, I like the consistency of syncPath being called across the various stages: write all files → syncPath(all files) → syncPath(directory) → swap → syncPath(parents).

p0lyn0mial · 2026-04-28T19:50:36Z

do we have to also update the certsyncpod ?

sanchezl · 2026-04-28T21:04:42Z

do we have to also update the certsyncpod ?

No, certsyncpod already uses atomicdir.Sync to write its files, so it picks up the fsync changes.

p0lyn0mial · 2026-04-29T03:57:46Z

@@ -625,6 +625,12 @@ func (o *InstallOptions) writePod(rawPodBytes []byte, manifestFileName, resource
 	if err := os.WriteFile(path.Join(resourceDir, manifestFileName), []byte(finalPodBytes), 0600); err != nil {


WriteFileSync/ WriteFileFsync would give us a single function that makes any file write crash safe (callers can't forget the fsync) . It seems like a useful property. WDYT?

p0lyn0mial · 2026-04-29T04:15:37Z

+	if err := syncPath(path.Join(resourceDir, manifestFileName)); err != nil {
+		return fmt.Errorf("failed syncing %q: %w", path.Join(resourceDir, manifestFileName), err)
+	}
+	if err := syncPath(resourceDir); err != nil {


given that we also want to fsync the directory, why not include this functionality in WriteFileSync / WriteFileFsync?

p0lyn0mial · 2026-04-29T04:24:29Z

I asked claude to prepare the potential changes, it came up with https://github.com/sanchezl/library-go/compare/atomicdir-fsync...p0lyn0mial:library-go:atomicdir-fsync-rev?expand=1

WDYT ? (I didn't include WriteFileFsync:) )

openshift-ci-robot · 2026-04-29T16:02:41Z

@sanchezl: This pull request references Jira Issue OCPBUGS-84258, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

On SNO clusters, ungraceful shutdown can cause kube-apiserver cert files to be truncated or lost, rendering the cluster inoperable. Two code paths write critical files without fsync:

atomicdir.Sync writes files to a staging directory, atomically swaps it with the target via renameat2(RENAME_EXCHANGE), then deletes the old data. Without fsync, file data lives only in the kernel page cache. On ungraceful shutdown the journal replays the swap and deletion (metadata), but the file data was never flushed, leaving truncated or empty files.

installerpod.writePod uses bare os.WriteFile plus a delete-then-write pattern for kubelet manifests. On ungraceful shutdown, the delete is journaled but the new file data may not have reached disk, leaving the manifest missing.

Changes

atomicdir: fsync files and directories for crash durability

Add fsync calls to ensure data durability before old data is deleted:

fsync each file after writing to the staging directory

fsync the staging directory to persist directory entries

Perform the atomic swap (existing)

fsync parent directories of both target and staging to persist the swap — renameat2(RENAME_EXCHANGE) modifies parent directory dentries, so the parents must be fsynced to persist which inode each name points to

New SyncPath field on the fileSystem struct enables mocking in tests.

installerpod: fsync pod manifest writes for crash durability

Add fsync after each os.WriteFile call and fsync the parent directories to ensure both the resource copy and the kubelet manifest are durable before the function returns.

Test plan

Existing TestSync cases continue to pass (fsync is transparent for happy path)

4 new error injection tests verify graceful failure when fsync fails at each stage:

File sync failure → target directory unchanged

Staging directory sync failure → target directory unchanged

Parent of target sync failure → swap already happened, error returned

Parent of staging sync failure → swap already happened, error returned

New TestSyncOperationOrdering records the sequence of filesystem calls and asserts the critical ordering invariants: all writes before file syncs, all file syncs before staging dir sync, staging dir sync before swap, swap before parent syncs, parent syncs before RemoveAll

Summary by CodeRabbit

Improvements

Stronger, crash‑durable filesystem syncs for static pod manifests and atomic directory swaps to improve persistence and robustness.

Bug Fixes

Ensured writes, file fsyncs, atomic swap, and parent‑directory fsyncs occur in the correct order to reduce risk of data loss or inconsistent state.

Tests

Adjusted sync tests (removed one failure scenario) and refined test coverage around sync ordering.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

atomicdir.Sync writes files to a staging directory, atomically swaps it with the target directory via renameat2(RENAME_EXCHANGE), then deletes the old data. Without fsync, file data lives only in the kernel page cache. On ungraceful shutdown the journal replays the swap and deletion (metadata), but the file data was never flushed, leaving truncated or empty files. Introduce an fsutil package with WriteFileFsync (write + fsync file + fsync parent directory) and Fsync (fsync a path) primitives. Use WriteFileFsync for all file writes so each file is individually durable, and fsync both parent directories after the swap to persist which inode each directory name points to.

writePod uses bare os.WriteFile plus a delete-then-write pattern for kubelet manifests. On ungraceful shutdown, the delete is journaled but the new file data may not have reached disk, leaving the manifest missing. Replace os.WriteFile with fsutil.WriteFileFsync, which writes, fsyncs the file, and fsyncs the parent directory in a single call, ensuring both the resource copy and the kubelet manifest are durable before the function returns.

p0lyn0mial · 2026-04-30T15:05:57Z

@@ -84,5 +85,16 @@ func sync(fs *fileSystem, targetDir string, targetDirPerm os.FileMode, stagingDi
 	if err := fs.SwapDirectories(targetDir, stagingDir); err != nil {


actually, given that WriteFileFsync is similar to os.MkdirAll or os.RemoveAll.

Maybe SwapDirectories could all Fsync internally ?

I almost did that but fsutil.WriteFileFsync calls fsutil.Fsync directly, not fs.Fsync from the struct, so it didn't make sense.

p0lyn0mial · 2026-05-04T08:29:50Z

/lgtm

/hold

for testing. please update openshift/cluster-kube-apiserver-operator#2124 and make sure the CI is happy before we merge this PR.

openshift-ci · 2026-05-04T08:30:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: p0lyn0mial, sanchezl, tchap

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/operator/staticpod/OWNERS~~ [p0lyn0mial]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sanchezl · 2026-05-04T18:47:39Z

/hold cancel

openshift/cluster-kube-apiserver-operator#2124 is in good shape

openshift-ci · 2026-05-04T19:06:17Z

@sanchezl: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-05-04T19:08:34Z

@sanchezl: Jira Issue OCPBUGS-84258: Some pull requests linked via external trackers have merged:

openshift/library-go#2176

The following pull request, linked via external tracker, has not merged:

openshift/cluster-kube-apiserver-operator#2124 is open

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-84258 has not been moved to the MODIFIED state.

Details

In response to this:

Summary

On SNO clusters, ungraceful shutdown can cause kube-apiserver cert files to be truncated or lost, rendering the cluster inoperable. Two code paths write critical files without fsync:

atomicdir.Sync writes files to a staging directory, atomically swaps it with the target via renameat2(RENAME_EXCHANGE), then deletes the old data. Without fsync, file data lives only in the kernel page cache. On ungraceful shutdown the journal replays the swap and deletion (metadata), but the file data was never flushed, leaving truncated or empty files.

installerpod.writePod uses bare os.WriteFile plus a delete-then-write pattern for kubelet manifests. On ungraceful shutdown, the delete is journaled but the new file data may not have reached disk, leaving the manifest missing.

Changes

New fsutil package

Introduces pkg/operator/staticpod/internal/fsutil with two durable I/O primitives:

Fsync — fsyncs a file or directory, checking both sync and close errors.

WriteFileFsync — writes a file, fsyncs the file, and fsyncs the parent directory to ensure both the data and the directory entry are durable on disk.

atomicdir: fsync files and directories for crash durability

The fileSystem struct field WriteFile is renamed to WriteFileFsync and now uses fsutil.WriteFileFsync instead of os.WriteFile, so each file write is individually durable (file data + parent directory entry).

New Fsync field on the fileSystem struct (set to fsutil.Fsync) enables mocking fsync failures in tests.

After the atomic swap, parent directories of both target and staging are fsynced via fs.Fsync to persist the renameat2(RENAME_EXCHANGE) result.

installerpod: fsync pod manifest writes for crash durability

Replace os.WriteFile with fsutil.WriteFileFsync for both the resource directory and kubelet manifest writes, ensuring file data and directory entries are durable before the function returns.

Test plan

Existing TestSync cases continue to pass (fsync is transparent for happy path)

New fsutil unit tests verify WriteFileFsync and Fsync (content, permissions, error paths)

Error injection tests verify graceful failure when mkdir, write, or swap fails at each stage

Two new parent fsync failure tests verify that when fsync fails after a successful swap, the directory is synchronized but an error is returned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sanchezl · 2026-05-07T16:10:48Z

/jira refresh

openshift-ci-robot · 2026-05-07T16:10:53Z

@sanchezl: Jira Issue OCPBUGS-84258: All pull requests linked via external trackers have merged:

openshift/library-go#2176

Jira Issue OCPBUGS-84258 has been moved to the MODIFIED state.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sanchezl · 2026-05-07T16:19:11Z

/jira backport release-4.22,release-4.21,release-4.20,release-4.19,release-4.18

openshift-ci-robot · 2026-05-07T16:24:53Z

@sanchezl: The following backport issues have been created:

OCPBUGS-85269 for branch release-4.22
OCPBUGS-85270 for branch release-4.21
OCPBUGS-85271 for branch release-4.20
OCPBUGS-85272 for branch release-4.19
OCPBUGS-85273 for branch release-4.18

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22
/cherrypick release-4.21
/cherrypick release-4.20
/cherrypick release-4.19
/cherrypick release-4.18

Details

In response to this:

/jira backport release-4.22,release-4.21,release-4.20,release-4.19,release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-cherrypick-robot · 2026-05-07T16:25:47Z

@openshift-ci-robot: #2176 failed to apply on top of branch "release-4.18":

Applying: atomicdir: fsync files and directories for crash durability
Using index info to reconstruct a base tree...
A	pkg/operator/staticpod/internal/atomicdir/sync.go
A	pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go deleted in HEAD and modified in atomicdir: fsync files and directories for crash durability. Version atomicdir: fsync files and directories for crash durability of pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go left in tree.
CONFLICT (modify/delete): pkg/operator/staticpod/internal/atomicdir/sync.go deleted in HEAD and modified in atomicdir: fsync files and directories for crash durability. Version atomicdir: fsync files and directories for crash durability of pkg/operator/staticpod/internal/atomicdir/sync.go left in tree.
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 atomicdir: fsync files and directories for crash durability

Details

In response to this:

@sanchezl: The following backport issues have been created:

OCPBUGS-85269 for branch release-4.22

OCPBUGS-85270 for branch release-4.21

OCPBUGS-85271 for branch release-4.20

OCPBUGS-85272 for branch release-4.19

OCPBUGS-85273 for branch release-4.18

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22
/cherrypick release-4.21
/cherrypick release-4.20
/cherrypick release-4.19
/cherrypick release-4.18

In response to this:

/jira backport release-4.22,release-4.21,release-4.20,release-4.19,release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-cherrypick-robot · 2026-05-07T16:26:27Z

@openshift-ci-robot: #2176 failed to apply on top of branch "release-4.19":

Applying: atomicdir: fsync files and directories for crash durability
Using index info to reconstruct a base tree...
A	pkg/operator/staticpod/internal/atomicdir/sync.go
A	pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go deleted in HEAD and modified in atomicdir: fsync files and directories for crash durability. Version atomicdir: fsync files and directories for crash durability of pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go left in tree.
CONFLICT (modify/delete): pkg/operator/staticpod/internal/atomicdir/sync.go deleted in HEAD and modified in atomicdir: fsync files and directories for crash durability. Version atomicdir: fsync files and directories for crash durability of pkg/operator/staticpod/internal/atomicdir/sync.go left in tree.
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 atomicdir: fsync files and directories for crash durability

Details

In response to this:

@sanchezl: The following backport issues have been created:

OCPBUGS-85269 for branch release-4.22

OCPBUGS-85270 for branch release-4.21

OCPBUGS-85271 for branch release-4.20

OCPBUGS-85272 for branch release-4.19

OCPBUGS-85273 for branch release-4.18

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22
/cherrypick release-4.21
/cherrypick release-4.20
/cherrypick release-4.19
/cherrypick release-4.18

In response to this:

/jira backport release-4.22,release-4.21,release-4.20,release-4.19,release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-cherrypick-robot · 2026-05-07T16:27:06Z

@openshift-ci-robot: #2176 failed to apply on top of branch "release-4.20":

Applying: atomicdir: fsync files and directories for crash durability
Using index info to reconstruct a base tree...
A	pkg/operator/staticpod/internal/atomicdir/sync.go
A	pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go deleted in HEAD and modified in atomicdir: fsync files and directories for crash durability. Version atomicdir: fsync files and directories for crash durability of pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go left in tree.
CONFLICT (modify/delete): pkg/operator/staticpod/internal/atomicdir/sync.go deleted in HEAD and modified in atomicdir: fsync files and directories for crash durability. Version atomicdir: fsync files and directories for crash durability of pkg/operator/staticpod/internal/atomicdir/sync.go left in tree.
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 atomicdir: fsync files and directories for crash durability

Details

In response to this:

@sanchezl: The following backport issues have been created:

OCPBUGS-85269 for branch release-4.22

OCPBUGS-85270 for branch release-4.21

OCPBUGS-85271 for branch release-4.20

OCPBUGS-85272 for branch release-4.19

OCPBUGS-85273 for branch release-4.18

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22
/cherrypick release-4.21
/cherrypick release-4.20
/cherrypick release-4.19
/cherrypick release-4.18

In response to this:

/jira backport release-4.22,release-4.21,release-4.20,release-4.19,release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-cherrypick-robot · 2026-05-07T16:27:47Z

@openshift-ci-robot: new pull request created: #2204

Details

In response to this:

@sanchezl: The following backport issues have been created:

OCPBUGS-85269 for branch release-4.22

OCPBUGS-85270 for branch release-4.21

OCPBUGS-85271 for branch release-4.20

OCPBUGS-85272 for branch release-4.19

OCPBUGS-85273 for branch release-4.18

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22
/cherrypick release-4.21
/cherrypick release-4.20
/cherrypick release-4.19
/cherrypick release-4.18

In response to this:

/jira backport release-4.22,release-4.21,release-4.20,release-4.19,release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-cherrypick-robot · 2026-05-07T16:28:28Z

@openshift-ci-robot: new pull request created: #2205

Details

In response to this:

@sanchezl: The following backport issues have been created:

OCPBUGS-85269 for branch release-4.22

OCPBUGS-85270 for branch release-4.21

OCPBUGS-85271 for branch release-4.20

OCPBUGS-85272 for branch release-4.19

OCPBUGS-85273 for branch release-4.18

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22
/cherrypick release-4.21
/cherrypick release-4.20
/cherrypick release-4.19
/cherrypick release-4.18

In response to this:

/jira backport release-4.22,release-4.21,release-4.20,release-4.19,release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Apr 24, 2026

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Apr 24, 2026

openshift-ci Bot requested review from dgrisonnet and p0lyn0mial April 24, 2026 13:29

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread pkg/operator/staticpod/internal/atomicdir/sync.go Outdated

sanchezl changed the title ~~OCPBUGS-84258: atomicdir: fsync files and directories for crash durability~~ OCPBUGS-84258: fsync static pod cert and manifest writes for crash durability Apr 24, 2026

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

sanchezl mentioned this pull request Apr 24, 2026

OCPBUGS-84258: fsync static pod cert and manifest writes for crash durability openshift/cluster-kube-apiserver-operator#2124

Closed

tchap suggested changes Apr 27, 2026

View reviewed changes

Comment thread pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go Outdated

Comment thread pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go Outdated

Comment thread pkg/operator/staticpod/installerpod/cmd.go

Comment thread pkg/operator/staticpod/internal/atomicdir/sync.go

sanchezl force-pushed the atomicdir-fsync branch from f8e0ee1 to f9ebb30 Compare April 27, 2026 19:39

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 27, 2026

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

openshift-ci Bot assigned tchap Apr 28, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 28, 2026

p0lyn0mial reviewed Apr 28, 2026

View reviewed changes

p0lyn0mial reviewed Apr 29, 2026

View reviewed changes

sanchezl force-pushed the atomicdir-fsync branch from f9ebb30 to 6f0ecc4 Compare April 29, 2026 15:56

openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2026

sanchezl force-pushed the atomicdir-fsync branch 3 times, most recently from a6b3ef0 to 64b88b2 Compare April 30, 2026 12:20

sanchezl added 2 commits April 30, 2026 08:32

sanchezl force-pushed the atomicdir-fsync branch from 64b88b2 to f2c5c7f Compare April 30, 2026 12:32

p0lyn0mial reviewed Apr 30, 2026

View reviewed changes

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 4, 2026

openshift-ci Bot assigned p0lyn0mial May 4, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 4, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2026

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 4, 2026

openshift-merge-bot Bot merged commit 2dd4388 into openshift:master May 4, 2026
5 checks passed

openshift-cherrypick-robot mentioned this pull request May 7, 2026

[release-4.21] OCPBUGS-85270: fsync static pod cert and manifest writes for crash durability #2204

Open

openshift-cherrypick-robot mentioned this pull request May 7, 2026

[release-4.22] OCPBUGS-85269: fsync static pod cert and manifest writes for crash durability #2205

Merged

sanchezl mentioned this pull request May 7, 2026

[release-4.20] OCPBUGS-85271: fsync static pod cert and manifest writes for crash durability #2207

Open

		@@ -625,6 +625,12 @@ func (o *InstallOptions) writePod(rawPodBytes []byte, manifestFileName, resource
		if err := os.WriteFile(path.Join(resourceDir, manifestFileName), []byte(finalPodBytes), 0600); err != nil {

		@@ -84,5 +85,16 @@ func sync(fs *fileSystem, targetDir string, targetDirPerm os.FileMode, stagingDi
		if err := fs.SwapDirectories(targetDir, stagingDir); err != nil {

Conversation

sanchezl commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New fsutil package

atomicdir: fsync files and directories for crash durability

installerpod: fsync pod manifest writes for crash durability

Test plan

Uh oh!

openshift-ci-robot commented Apr 24, 2026

Summary

Changes

Test plan

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

p0lyn0mial commented Apr 24, 2026

Uh oh!

sanchezl commented Apr 27, 2026

PROOF results (via CKAO#2124)

CI: 13/13 green

Payload jobs: 3/3 passed

What this proves

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 27, 2026

Summary

Changes

atomicdir: fsync files and directories for crash durability

installerpod: fsync pod manifest writes for crash durability

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tchap commented Apr 28, 2026

Uh oh!

p0lyn0mial Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

sanchezl Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

p0lyn0mial commented Apr 28, 2026

Uh oh!

sanchezl commented Apr 28, 2026

Uh oh!

p0lyn0mial Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

p0lyn0mial Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

p0lyn0mial commented Apr 29, 2026

Uh oh!

openshift-ci-robot commented Apr 29, 2026

Summary

Changes

atomicdir: fsync files and directories for crash durability

installerpod: fsync pod manifest writes for crash durability

Test plan

sanchezl commented Apr 24, 2026 •

edited

Loading

New `fsutil` package

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

New `fsutil` package