Run evacuation asynchronously, so it can evacuate 30+ vms#13
Run evacuation asynchronously, so it can evacuate 30+ vms#13dawiddeja wants to merge 1 commit intobeekhof:masterfrom
Conversation
Also fixing 'simultaneous controller and compute fail' problem
Since fence_compute is already a Python module, there is no reason to call a subprocess Thread for running Nova CLI but rather we should directly use the stable Nova client API.
|
What about this instead? |
|
Unfortunately I don't think this is safe due to a race condition in Nova following evacuate. Until this race is fixed, the logic we need from pacemaker is:
Due to a design deficiency in Nova, it is not possible to distinguish between an instance in the middle of a rebuild and an instance being evacuated. This means that in order to reliably detect that instances have been evacuated, we need the additional context that an evacuation is in progress, and we are waiting for its completion. I don't believe this context exists when enabling a resource, so I don't believe we can reliably do this check there. I believe the only place it exists is in the fence script itself, which therefore must block until all instances have been evacuated. If this hits a timeout, I think we have to extend the timeout. If it's possible to programatically extend the timeout when we detect liveness that could be more robust. This sucks, but we can improve it when Nova is fixed. |
|
@beekhof Is it worth revisiting this, or shall we just close? |
This brings back evacuating VM's asynchronously, since if we do not do it, timeout will be reached if there is a lot of instances on dead host.
Also, move 'wait for nova to update it internal state' part outside the fencing script, so it should cover simultaneous compute and controller failure problem. Even if we want to resolve it another way, evacuation itself cannot be run inside fencing script, cause for a lot of VMs it can take very long.