WIP: pkg/agent: wait for all volumes to be detached before rebooting#169
Draft
WIP: pkg/agent: wait for all volumes to be detached before rebooting#169
Conversation
Member
Author
|
We should also consider compatibility with k8s versions before merging. |
Member
Author
|
Just hit some issue with this code:
Perhaps we should also have some timeout while waiting for volumes to be detached. |
This commit provides PoC version of implementing agent waiting for all volumtes attached to the node to be detached as a step after draining the node, as shutting down the Pod does not mean the volume has been detached, as usually CSI agent will be running as a DaemonSet on the node and will take care of detaching the volume from the node when the pod shuts down. This commit improves rebooting experience, as right now if there is not enough time for CSI agent to detach the volumes from the node, node gets rebooted and pods using attached volumes have no way to be attached to other nodes, which effectively increases the downtime caused for stateful workloads. This commit still requires tests and better interface for the users. If someone wants to try this feature on their own cluster, I've published the following image I've been testing with: quay.io/invidian/flatcar-linux-update-operator:97c0dee50c807dbba7d2debc59b369f84002797e Closes #30 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
2affae3 to
88957e7
Compare
Member
Author
|
I also found a bug with RBAC which is now fixed. |
invidian
commented
Jan 17, 2023
|
|
||
| klog.Info("Node drained, rebooting") | ||
|
|
||
| for { |
Member
Author
There was a problem hiding this comment.
We should make this optional behind a opt-in flag.
| @@ -290,6 +290,30 @@ func (k *klocksmith) process(ctx context.Context) error { | |||
|
|
|||
| klog.Info("Node drained, rebooting") | |||
Member
Author
There was a problem hiding this comment.
This message needs to be adjusted (or perhaps moved below the volumes detachment).
Comment on lines
+296
to
+297
| klog.Errorf("Listing volume attachments: %v", err) | ||
| continue |
Member
Author
There was a problem hiding this comment.
We should probably add some mechanism here to give up, for example a timeout.
| break | ||
| } | ||
|
|
||
| time.Sleep(5 * time.Second) |
Member
Author
There was a problem hiding this comment.
Perhaps this should also be adjustable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit provides PoC version of implementing agent waiting for all
volumtes attached to the node to be detached as a step after draining
the node, as shutting down the Pod does not mean the volume has been
detached, as usually CSI agent will be running as a DaemonSet on the
node and will take care of detaching the volume from the node when the
pod shuts down.
This commit improves rebooting experience, as right now if there is not
enough time for CSI agent to detach the volumes from the node, node gets
rebooted and pods using attached volumes have no way to be attached to
other nodes, which effectively increases the downtime caused for
stateful workloads.
This commit still requires tests and better interface for the users.
If someone wants to try this feature on their own cluster, I've
published the following image I've been testing with:
quay.io/invidian/flatcar-linux-update-operator:97c0dee50c807dbba7d2debc59b369f84002797e
Closes #30
Signed-off-by: Mateusz Gozdek mgozdek@microsoft.com