OCPBUGS-84258: fsync static pod cert and manifest writes for crash durability#2176
Conversation
|
@sanchezl: This pull request references Jira Issue OCPBUGS-84258, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds durability: writes use fsutil.WriteFileFsync; atomic directory exchange is performed by a new swapFsync that fsyncs parent directories after the rename-exchange; installer manifest writes now use fsync-backed writes. One test failure scenario for SwapDirectories was removed. Changes
Sequence Diagram(s)sequenceDiagram
participant Writer as Writer
participant FS as fsutil
participant Staging as StagingDir
participant Swap as renameat2(RENAME_EXCHANGE)
participant Parents as ParentDirs
Writer->>FS: WriteFileFsync(staging/file)
FS->>FS: fsync file\nclose file\nfsync staging dir
Writer->>Swap: swap staging <-> target (rename_exchange)
Swap->>Parents: SyncPath(parent(target))
Swap->>Parents: SyncPath(parent(staging))
Parents->>Writer: swap complete
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go (1)
550-597: Assert the required fsync operations exist, not just their relative order.As written, this can still pass when a whole class of ops disappears.
firstFileSync/firstParentSyncdefault tolen(ops), so removing all file fsyncs or both parent fsyncs won’t necessarily fail the test. That weakens the regression guard for the new durability contract.Suggested tightening
lastWrite := -1 firstFileSync := len(ops) lastFileSync := -1 + fileSyncCount := 0 dirSyncIdx := -1 swapIdx := -1 firstParentSync := len(ops) lastParentSync := -1 + parentSyncCount := 0 removeIdx := -1 for i, o := range ops { switch o.kind { case opWriteFile: lastWrite = i case opSyncFile: + fileSyncCount++ if i < firstFileSync { firstFileSync = i } lastFileSync = i case opSyncDir: dirSyncIdx = i case opSwap: swapIdx = i case opSyncParent: + parentSyncCount++ if i < firstParentSync { firstParentSync = i } lastParentSync = i case opRemoveAll: removeIdx = i } } + if fileSyncCount != len(files) { + t.Fatalf("expected %d file syncs, got %d", len(files), fileSyncCount) + } + if dirSyncIdx == -1 { + t.Fatal("expected a staging directory sync") + } + if swapIdx == -1 { + t.Fatal("expected a directory swap") + } + if parentSyncCount != 2 { + t.Fatalf("expected 2 parent syncs, got %d", parentSyncCount) + } + if removeIdx == -1 { + t.Fatal("expected staging directory removal") + } + if lastWrite >= firstFileSync { t.Errorf("all writes must complete before any file sync: last write at %d, first file sync at %d", lastWrite, firstFileSync) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go` around lines 550 - 597, The test currently only checks ordering using sentinel defaults, which lets whole classes of ops be missing; add explicit existence assertions for the critical fsync/swap/remove ops before doing ordering checks: assert that firstFileSync != len(ops) (i.e. at least one opSyncFile exists) and firstParentSync != len(ops) (at least one opSyncParent exists) and also assert dirSyncIdx != -1, swapIdx != -1, and removeIdx != -1, returning t.Errorf with clear messages if any are missing, then keep the existing relative-order checks that reference ops, opSyncFile, opSyncParent, dirSyncIdx, swapIdx, and removeIdx.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/staticpod/internal/atomicdir/sync.go`:
- Around line 40-46: The syncPath function currently ignores the error returned
by f.Close(), so change it to capture and propagate Close() errors: call
f.Sync(), capture its error into a variable (errSync), then call f.Close() and
capture its error (errClose), and return errClose if errSync is nil else return
errSync (or combine appropriately so Close errors are not lost); update the
function named syncPath to ensure both Sync() and Close() errors are checked and
the Close() error is returned when Sync() succeeded.
---
Nitpick comments:
In `@pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go`:
- Around line 550-597: The test currently only checks ordering using sentinel
defaults, which lets whole classes of ops be missing; add explicit existence
assertions for the critical fsync/swap/remove ops before doing ordering checks:
assert that firstFileSync != len(ops) (i.e. at least one opSyncFile exists) and
firstParentSync != len(ops) (at least one opSyncParent exists) and also assert
dirSyncIdx != -1, swapIdx != -1, and removeIdx != -1, returning t.Errorf with
clear messages if any are missing, then keep the existing relative-order checks
that reference ops, opSyncFile, opSyncParent, dirSyncIdx, swapIdx, and
removeIdx.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: df5c87bc-3dab-4244-ad9a-d8b6565a9d60
📒 Files selected for processing (2)
pkg/operator/staticpod/internal/atomicdir/sync.gopkg/operator/staticpod/internal/atomicdir/sync_linux_test.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/operator/staticpod/installerpod/cmd.go (1)
628-633: Wrap fsync errors with path/operation context.These branches currently return raw errors, which makes on-node diagnosis hard when one of the four sync steps fails.
Proposed diff
- if err := syncPath(path.Join(resourceDir, manifestFileName)); err != nil { - return err + if err := syncPath(path.Join(resourceDir, manifestFileName)); err != nil { + return fmt.Errorf("failed syncing resource manifest file %q: %w", path.Join(resourceDir, manifestFileName), err) } - if err := syncPath(resourceDir); err != nil { - return err + if err := syncPath(resourceDir); err != nil { + return fmt.Errorf("failed syncing resource manifest directory %q: %w", resourceDir, err) } @@ - if err := syncPath(path.Join(o.PodManifestDir, manifestFileName)); err != nil { - return err + if err := syncPath(path.Join(o.PodManifestDir, manifestFileName)); err != nil { + return fmt.Errorf("failed syncing pod manifest file %q: %w", path.Join(o.PodManifestDir, manifestFileName), err) } - if err := syncPath(o.PodManifestDir); err != nil { - return err + if err := syncPath(o.PodManifestDir); err != nil { + return fmt.Errorf("failed syncing pod manifest directory %q: %w", o.PodManifestDir, err) }Also applies to: 645-650
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/staticpod/installerpod/cmd.go` around lines 628 - 633, The syncPath calls currently return raw errors which lack context; update both places that call syncPath (the calls involving syncPath(path.Join(resourceDir, manifestFileName)) and syncPath(resourceDir), and the similar block at the later lines) to wrap the returned error with operation and path information—e.g., on error return fmt.Errorf("fsync %s: %w", path.Join(resourceDir, manifestFileName), err) or an equivalent errors.Wrapf so the error message includes the failing operation and the affected path and use the same pattern for resourceDir and the other two sync steps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/operator/staticpod/installerpod/cmd.go`:
- Around line 628-633: The syncPath calls currently return raw errors which lack
context; update both places that call syncPath (the calls involving
syncPath(path.Join(resourceDir, manifestFileName)) and syncPath(resourceDir),
and the similar block at the later lines) to wrap the returned error with
operation and path information—e.g., on error return fmt.Errorf("fsync %s: %w",
path.Join(resourceDir, manifestFileName), err) or an equivalent errors.Wrapf so
the error message includes the failing operation and the affected path and use
the same pattern for resourceDir and the other two sync steps.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f3c08e63-75fb-4082-96fc-710451db2fde
📒 Files selected for processing (1)
pkg/operator/staticpod/installerpod/cmd.go
|
@tchap hey, could you take a look at this pr ? |
PROOF results (via CKAO#2124)A PROOF PR was created on cluster-kube-apiserver-operator to validate this fix end-to-end. CI: 13/13 greenAll CI checks passed, including unit tests, verify, images, and all 7 e2e jobs. Payload jobs: 3/3 passed
SNO and SNO-upgrade payloads were chosen because the customer environment is SNO edge clusters with ungraceful shutdowns during upgrades. GCP-upgrade covers multi-node upgrade regression. What this provesDirect proof (unit tests): The unit test JUnit shows all new atomicdir tests passed on Linux CI:
Integration proof (CKAO PROOF PR): The installer pod and cert-syncer both call |
f8e0ee1 to
f9ebb30
Compare
|
@sanchezl: This pull request references Jira Issue OCPBUGS-84258, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go (1)
540-541: Ordering test normalization is too broad and can mask regressions.
sortConsecutiveSameKindOpscurrently sorts all consecutive same-kind runs. That can hide order changes for deterministic stages (for exampleMkdirAll,SyncParent). Please limit normalization to map-iteration-driven kinds (WriteFile,SyncFile) only.Proposed tightening
- // Sort consecutive same-kind ops by path to normalize non-deterministic map iteration order. - sortConsecutiveSameKindOps(ops) + // Normalize only map-iteration-driven operations. + sortConsecutiveSameKindOps(ops, map[string]bool{ + "WriteFile": true, + "SyncFile": true, + }) -func sortConsecutiveSameKindOps(ops []fsOp) { +func sortConsecutiveSameKindOps(ops []fsOp, sortableKinds map[string]bool) { i := 0 for i < len(ops) { j := i + 1 for j < len(ops) && ops[j].Kind == ops[i].Kind { j++ } - if j-i > 1 { + if j-i > 1 && sortableKinds[ops[i].Kind] { sort.Slice(ops[i:j], func(a, b int) bool { return ops[i+a].Path < ops[i+b].Path }) } i = j } }Also applies to: 569-585
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go` around lines 540 - 541, The normalization is too broad: restrict sorting to only map-iteration nondeterministic kinds (WriteFile and SyncFile) instead of sorting all consecutive same-kind runs. Modify the use/implementation of sortConsecutiveSameKindOps so it either accepts a predicate/whitelist or add a new helper (e.g., sortConsecutiveSameKindOpsForMapKinds) that scans the ops slice and only sorts runs whose kind is WriteFile or SyncFile (leave MkdirAll, SyncParent and other kinds in their original order); update the two call sites that currently call sortConsecutiveSameKindOps(ops) to call the tightened helper/predicate so only WriteFile and SyncFile runs get reordered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/operator/staticpod/internal/atomicdir/sync_linux_test.go`:
- Around line 540-541: The normalization is too broad: restrict sorting to only
map-iteration nondeterministic kinds (WriteFile and SyncFile) instead of sorting
all consecutive same-kind runs. Modify the use/implementation of
sortConsecutiveSameKindOps so it either accepts a predicate/whitelist or add a
new helper (e.g., sortConsecutiveSameKindOpsForMapKinds) that scans the ops
slice and only sorts runs whose kind is WriteFile or SyncFile (leave MkdirAll,
SyncParent and other kinds in their original order); update the two call sites
that currently call sortConsecutiveSameKindOps(ops) to call the tightened
helper/predicate so only WriteFile and SyncFile runs get reordered.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 447562ae-61c7-415d-93c4-963b1ae75ce4
📒 Files selected for processing (3)
pkg/operator/staticpod/installerpod/cmd.gopkg/operator/staticpod/internal/atomicdir/sync.gopkg/operator/staticpod/internal/atomicdir/sync_linux_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/operator/staticpod/installerpod/cmd.go
|
I took a look as requested, |
| var realFS = fileSystem{ | ||
| MkdirAll: os.MkdirAll, | ||
| RemoveAll: os.RemoveAll, | ||
| WriteFile: os.WriteFile, |
There was a problem hiding this comment.
could we replace os.WriteFile with WriteFileSync (we would have to implement it) ?
would that solve the file sync issue ?
There was a problem hiding this comment.
It would only partially solve the issue as we'd still need syncPath for the directory and parent fsyncs.
While a WriteFileSync helper would work, I like the consistency of syncPath being called across the various stages: write all files → syncPath(all files) → syncPath(directory) → swap → syncPath(parents).
|
do we have to also update the certsyncpod ? |
No, certsyncpod already uses |
| @@ -625,6 +625,12 @@ func (o *InstallOptions) writePod(rawPodBytes []byte, manifestFileName, resource | |||
| if err := os.WriteFile(path.Join(resourceDir, manifestFileName), []byte(finalPodBytes), 0600); err != nil { | |||
There was a problem hiding this comment.
WriteFileSync/ WriteFileFsync would give us a single function that makes any file write crash safe (callers can't forget the fsync) . It seems like a useful property. WDYT?
| if err := syncPath(path.Join(resourceDir, manifestFileName)); err != nil { | ||
| return fmt.Errorf("failed syncing %q: %w", path.Join(resourceDir, manifestFileName), err) | ||
| } | ||
| if err := syncPath(resourceDir); err != nil { |
There was a problem hiding this comment.
given that we also want to fsync the directory, why not include this functionality in WriteFileSync / WriteFileFsync?
|
I asked claude to prepare the potential changes, it came up with https://github.com/sanchezl/library-go/compare/atomicdir-fsync...p0lyn0mial:library-go:atomicdir-fsync-rev?expand=1 WDYT ? (I didn't include |
f9ebb30 to
6f0ecc4
Compare
|
@sanchezl: This pull request references Jira Issue OCPBUGS-84258, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
a6b3ef0 to
64b88b2
Compare
atomicdir.Sync writes files to a staging directory, atomically swaps it with the target directory via renameat2(RENAME_EXCHANGE), then deletes the old data. Without fsync, file data lives only in the kernel page cache. On ungraceful shutdown the journal replays the swap and deletion (metadata), but the file data was never flushed, leaving truncated or empty files. Introduce an fsutil package with WriteFileFsync (write + fsync file + fsync parent directory) and Fsync (fsync a path) primitives. Use WriteFileFsync for all file writes so each file is individually durable, and fsync both parent directories after the swap to persist which inode each directory name points to.
writePod uses bare os.WriteFile plus a delete-then-write pattern for kubelet manifests. On ungraceful shutdown, the delete is journaled but the new file data may not have reached disk, leaving the manifest missing. Replace os.WriteFile with fsutil.WriteFileFsync, which writes, fsyncs the file, and fsyncs the parent directory in a single call, ensuring both the resource copy and the kubelet manifest are durable before the function returns.
64b88b2 to
f2c5c7f
Compare
| @@ -84,5 +85,16 @@ func sync(fs *fileSystem, targetDir string, targetDirPerm os.FileMode, stagingDi | |||
| if err := fs.SwapDirectories(targetDir, stagingDir); err != nil { | |||
There was a problem hiding this comment.
actually, given that WriteFileFsync is similar to os.MkdirAll or os.RemoveAll.
Maybe SwapDirectories could all Fsync internally ?
There was a problem hiding this comment.
I almost did that but fsutil.WriteFileFsync calls fsutil.Fsync directly, not fs.Fsync from the struct, so it didn't make sense.
|
/lgtm /hold for testing. please update openshift/cluster-kube-apiserver-operator#2124 and make sure the CI is happy before we merge this PR. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: p0lyn0mial, sanchezl, tchap The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel openshift/cluster-kube-apiserver-operator#2124 is in good shape |
|
@sanchezl: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@sanchezl: Jira Issue OCPBUGS-84258: Some pull requests linked via external trackers have merged: The following pull request, linked via external tracker, has not merged: All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-84258 has not been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@sanchezl: Jira Issue OCPBUGS-84258: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-84258 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira backport release-4.22,release-4.21,release-4.20,release-4.19,release-4.18 |
|
@sanchezl: The following backport issues have been created:
Queuing cherrypicks to the requested branches to be created after this PR merges: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-ci-robot: #2176 failed to apply on top of branch "release-4.18": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@openshift-ci-robot: #2176 failed to apply on top of branch "release-4.19": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@openshift-ci-robot: #2176 failed to apply on top of branch "release-4.20": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@openshift-ci-robot: new pull request created: #2204 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@openshift-ci-robot: new pull request created: #2205 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Summary
On SNO clusters, ungraceful shutdown can cause kube-apiserver cert files to be truncated or lost, rendering the cluster inoperable. Two code paths write critical files without fsync:
atomicdir.Syncwrites files to a staging directory, atomically swaps it with the target viarenameat2(RENAME_EXCHANGE), then deletes the old data. Without fsync, file data lives only in the kernel page cache. On ungraceful shutdown the journal replays the swap and deletion (metadata), but the file data was never flushed, leaving truncated or empty files.installerpod.writePoduses bareos.WriteFileplus a delete-then-write pattern for kubelet manifests. On ungraceful shutdown, the delete is journaled but the new file data may not have reached disk, leaving the manifest missing.Changes
New
fsutilpackageIntroduces
pkg/operator/staticpod/internal/fsutilwith two durable I/O primitives:Fsync— fsyncs a file or directory, checking both sync and close errors.WriteFileFsync— writes a file, fsyncs the file, and fsyncs the parent directory to ensure both the data and the directory entry are durable on disk.atomicdir: fsync files and directories for crash durability
fileSystemstruct fieldWriteFileis renamed toWriteFileFsyncand now usesfsutil.WriteFileFsyncinstead ofos.WriteFile, so each file write is individually durable (file data + parent directory entry).Fsyncfield on thefileSystemstruct (set tofsutil.Fsync) enables mocking fsync failures in tests.fs.Fsyncto persist therenameat2(RENAME_EXCHANGE)result.installerpod: fsync pod manifest writes for crash durability
Replace
os.WriteFilewithfsutil.WriteFileFsyncfor both the resource directory and kubelet manifest writes, ensuring file data and directory entries are durable before the function returns.Test plan
TestSynccases continue to pass (fsync is transparent for happy path)fsutilunit tests verifyWriteFileFsyncandFsync(content, permissions, error paths)