[release-4.20] OCPBUGS-85271: fsync static pod cert and manifest writes for crash durability#2207
Conversation
pkg/operator/staticpod/internal/atomicdir contains internal helpers for performing atomic operations on directories. This patch contains only a standalone swap function, which will be subsequenty used for synchronizing directory with the given state.
The function can be used to atomically sync a directory with the desired state. This uses atomicdir.swap implemented earlier.
Use atomicdir.Sync to write target secret/configmap directories to be synchronized with the relevant objects. Added unit tests, but the coverage is not complete. Particularly filesystem operations failing are not being tested.
atomicdir.Sync writes files to a staging directory, atomically swaps it with the target directory via renameat2(RENAME_EXCHANGE), then deletes the old data. Without fsync, file data lives only in the kernel page cache. On ungraceful shutdown the journal replays the swap and deletion (metadata), but the file data was never flushed, leaving truncated or empty files. Introduce an fsutil package with WriteFileFsync (write + fsync file + fsync parent directory) and Fsync (fsync a path) primitives. Use WriteFileFsync for all file writes so each file is individually durable, and fsync both parent directories after the swap to persist which inode each directory name points to.
writePod uses bare os.WriteFile plus a delete-then-write pattern for kubelet manifests. On ungraceful shutdown, the delete is journaled but the new file data may not have reached disk, leaving the manifest missing. Replace os.WriteFile with fsutil.WriteFileFsync, which writes, fsyncs the file, and fsyncs the parent directory in a single call, ensuring both the resource copy and the kubelet manifest are durable before the function returns.
|
Skipping CI for Draft Pull Request. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@sanchezl: This pull request references Jira Issue OCPBUGS-85271, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@sanchezl: This pull request references Jira Issue OCPBUGS-85271, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
@sanchezl: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.20-e2e-aws-ovn-upgrade-fips |
|
/payload-job periodic-ci-openshift-release-main-ci-4.20-upgrade-from-stable-4.19-e2e-gcp-ovn-upgrade |
|
/jira refresh |
|
@sanchezl: This pull request references Jira Issue OCPBUGS-85271, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rh-roman, sanchezl The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
This is a cherry-pick of #2176.