WIP: pkg/agent: wait for all volumes to be detached before rebooting by invidian · Pull Request #169 · flatcar/flatcar-linux-update-operator

invidian · 2022-06-13T17:11:52Z

This commit provides PoC version of implementing agent waiting for all
volumtes attached to the node to be detached as a step after draining
the node, as shutting down the Pod does not mean the volume has been
detached, as usually CSI agent will be running as a DaemonSet on the
node and will take care of detaching the volume from the node when the
pod shuts down.

This commit improves rebooting experience, as right now if there is not
enough time for CSI agent to detach the volumes from the node, node gets
rebooted and pods using attached volumes have no way to be attached to
other nodes, which effectively increases the downtime caused for
stateful workloads.

This commit still requires tests and better interface for the users.

If someone wants to try this feature on their own cluster, I've
published the following image I've been testing with:

quay.io/invidian/flatcar-linux-update-operator:97c0dee50c807dbba7d2debc59b369f84002797e

Closes #30

Signed-off-by: Mateusz Gozdek mgozdek@microsoft.com

invidian · 2022-06-13T17:12:42Z

We should also consider compatibility with k8s versions before merging.

invidian · 2022-06-26T15:11:22Z

Just hit some issue with this code:

Draining failed because one workload couldn't satisfy the PDB.
Waiting for volume detachment never finished.

Perhaps we should also have some timeout while waiting for volumes to be detached.

This commit provides PoC version of implementing agent waiting for all volumtes attached to the node to be detached as a step after draining the node, as shutting down the Pod does not mean the volume has been detached, as usually CSI agent will be running as a DaemonSet on the node and will take care of detaching the volume from the node when the pod shuts down. This commit improves rebooting experience, as right now if there is not enough time for CSI agent to detach the volumes from the node, node gets rebooted and pods using attached volumes have no way to be attached to other nodes, which effectively increases the downtime caused for stateful workloads. This commit still requires tests and better interface for the users. If someone wants to try this feature on their own cluster, I've published the following image I've been testing with: quay.io/invidian/flatcar-linux-update-operator:97c0dee50c807dbba7d2debc59b369f84002797e Closes #30 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

invidian · 2023-01-11T18:46:54Z

I also found a bug with RBAC which is now fixed.

invidian · 2023-01-17T13:46:30Z

pkg/agent/agent.go


 	klog.Info("Node drained, rebooting")

+	for {


We should make this optional behind a opt-in flag.

invidian · 2023-01-17T13:46:51Z

pkg/agent/agent.go

@@ -290,6 +290,30 @@ func (k *klocksmith) process(ctx context.Context) error {

 	klog.Info("Node drained, rebooting")


This message needs to be adjusted (or perhaps moved below the volumes detachment).

invidian · 2023-01-17T13:51:46Z

pkg/agent/agent.go

+			klog.Errorf("Listing volume attachments: %v", err)
+			continue


We should probably add some mechanism here to give up, for example a timeout.

invidian · 2023-01-17T13:58:02Z

pkg/agent/agent.go

+			break
+		}
+
+		time.Sleep(5 * time.Second)


Perhaps this should also be adjustable.

invidian mentioned this pull request Jun 13, 2022

Ensure update-agent waits for all volumes to be detached before rebooting #30

Open

invidian force-pushed the invidian/wait-for-volumes-detach branch from 2affae3 to 88957e7 Compare January 11, 2023 17:47

invidian commented Jan 17, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

WIP: pkg/agent: wait for all volumes to be detached before rebooting#169

WIP: pkg/agent: wait for all volumes to be detached before rebooting#169
invidian wants to merge 1 commit intomasterfrom
invidian/wait-for-volumes-detach

invidian commented Jun 13, 2022

Uh oh!

invidian commented Jun 13, 2022

Uh oh!

invidian commented Jun 26, 2022 •

edited

Loading

Uh oh!

invidian commented Jan 11, 2023

Uh oh!

invidian Jan 17, 2023

Uh oh!

invidian Jan 17, 2023

Uh oh!

invidian Jan 17, 2023

Uh oh!

invidian Jan 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -290,6 +290,30 @@ func (k *klocksmith) process(ctx context.Context) error {

		klog.Info("Node drained, rebooting")

Comments

Conversation

invidian commented Jun 13, 2022

Uh oh!

invidian commented Jun 13, 2022

Uh oh!

invidian commented Jun 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

invidian commented Jan 11, 2023

Uh oh!

invidian Jan 17, 2023

Choose a reason for hiding this comment

Uh oh!

invidian Jan 17, 2023

Choose a reason for hiding this comment

Uh oh!

invidian Jan 17, 2023

Choose a reason for hiding this comment

Uh oh!

invidian Jan 17, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

invidian commented Jun 26, 2022 •

edited

Loading