Fail fast on cloud node removal#372
Conversation
The agent is never named `ghost`, rather `slave0`
| sessions.then(j -> { | ||
| // Start up a build and then reboot and take the node offline | ||
| assertEquals(0, j.jenkins.getLabel("ghost").getNodes().size()); // Make sure test impl is correctly deleted | ||
| assertNull(j.jenkins.getNode("ghost")); // Make sure test impl is correctly deleted |
There was a problem hiding this comment.
Useless assertion, since the node would just have the label ghost but usually named slave0 based on generation.
| }); | ||
| } | ||
|
|
||
| @Test public void onceRetentionStrategyNodeDisappearance() throws Throwable { |
There was a problem hiding this comment.
Essentially a copy of normalNodeDisappearance checking some behavioural differences.
|
Looks like some race condition, the |
…this is causing some havoc.
|
Re-launching CI |
jglick
left a comment
There was a problem hiding this comment.
I am confused by the doubled-up cause of interruption.
| s.setRetentionStrategy(new OnceRetentionStrategy(0)); | ||
| var run = p.scheduleBuild2(0).waitForStart(); | ||
| j.waitForMessage("+ sleep infinity", run); | ||
| j.jenkins.removeNode(s); |
There was a problem hiding this comment.
IIUC this is what Reaper in kubernetes would do as soon as an agent pod is deleted.
jglick
left a comment
There was a problem hiding this comment.
Actually I do not understand why there is a new cause of interruption at all. The only change that should need to be made is for RemovedNodeListener to interrupt the build immediately rather than after a delay, right?
* Use only one cause * When build is cancelled immediately, use RemovedNodeCause * When build is cancelled after observing timeout, use RemovedNodeTimeoutCause * Introduce a marker interface to simplify matching in AgentErrorCondition
jglick
left a comment
There was a problem hiding this comment.
Simpler now, thanks. Some optional suggestions.
| if (isOneShotAgent(node)) { | ||
| LOGGER.fine(() -> "Cancelling owner run for one-shot agent " + node.getNodeName() + " immediately"); | ||
| cancelOwnerExecution(node, new RemovedNodeCause()); |
|
|
| return node instanceof AbstractCloudSlave || | ||
| (node instanceof Slave && ((Slave) node).getRetentionStrategy() instanceof OnceRetentionStrategy); |
There was a problem hiding this comment.
Unfortunately this heuristic does not match EC2AbstractSlave extends Slave nor EC2RetentionStrategy extends RetentionStrategy. I guess we need to hard-code support for those nonstandard implementations. CC @car-roll
There was a problem hiding this comment.
Can we start introducing a marker interface for this usage?
There was a problem hiding this comment.
Maybe, though I think it would suffice for EC2AbstractSlave to extend AbstractCloudSlave and EC2Computer to extend AbstractCloudComputer, with some minor refactoring to delete then-redundant logic. CloudBees-internal reference
There was a problem hiding this comment.
This cancels the node block immediately instead of waiting for the node to come in certain conditions when we are sure the node can't come back: using a cloud node, using
OnceRetentionStrategy.Testing done