Fail fast on cloud node removal by Vlatombe · Pull Request #372 · jenkinsci/workflow-durable-task-step-plugin

Vlatombe · 2024-05-14T13:32:54Z

This cancels the node block immediately instead of waiting for the node to come in certain conditions when we are sure the node can't come back: using a cloud node, using OnceRetentionStrategy.

Fail fast after cloud node removal
Remove useless assertion

Testing done

### Submitter checklist
- [ ] Make sure you are opening from a **topic/feature/bugfix branch** (right side) and not your main branch!
- [ ] Ensure that the pull request title represents the desired changelog entry
- [ ] Please describe what you did
- [ ] Link to relevant issues in GitHub or Jira
- [ ] Link to relevant pull requests, esp. upstream and downstream changes
- [ ] Ensure you have provided tests - that demonstrates feature works or fixes the issue

The agent is never named `ghost`, rather `slave0`

Vlatombe · 2024-05-14T13:38:27Z

        sessions.then(j -> {
            // Start up a build and then reboot and take the node offline
            assertEquals(0, j.jenkins.getLabel("ghost").getNodes().size()); // Make sure test impl is correctly deleted
-            assertNull(j.jenkins.getNode("ghost")); // Make sure test impl is correctly deleted


Useless assertion, since the node would just have the label ghost but usually named slave0 based on generation.

Vlatombe · 2024-05-14T13:41:10Z

        });
    }

+    @Test public void onceRetentionStrategyNodeDisappearance() throws Throwable {


Essentially a copy of normalNodeDisappearance checking some behavioural differences.

Vlatombe · 2024-05-14T14:32:16Z

Looks like some race condition, the Run with InterruptedBuildAction gets saved to disk properly, but somehow it's not visible in the Run object in the next session.

…this is causing some havoc.

Vlatombe · 2024-05-14T15:18:01Z

Re-launching CI

jglick

I am confused by the doubled-up cause of interruption.

jglick · 2024-05-14T16:30:52Z

+            s.setRetentionStrategy(new OnceRetentionStrategy(0));
+            var run = p.scheduleBuild2(0).waitForStart();
+            j.waitForMessage("+ sleep infinity", run);
+            j.jenkins.removeNode(s);


IIUC this is what Reaper in kubernetes would do as soon as an agent pod is deleted.

jglick

Actually I do not understand why there is a new cause of interruption at all. The only change that should need to be made is for RemovedNodeListener to interrupt the build immediately rather than after a delay, right?

* Use only one cause * When build is cancelled immediately, use RemovedNodeCause * When build is cancelled after observing timeout, use RemovedNodeTimeoutCause * Introduce a marker interface to simplify matching in AgentErrorCondition

jglick

Simpler now, thanks. Some optional suggestions.

jglick · 2024-05-15T13:23:59Z

+            if (isOneShotAgent(node)) {
+                LOGGER.fine(() -> "Cancelling owner run for one-shot agent " + node.getNodeName() + " immediately");
+                cancelOwnerExecution(node, new RemovedNodeCause());


(The crucial part FTR.)

jglick · 2024-05-15T14:12:11Z

ShellStepTest.removingAgentIsFatal needs a change to expected message.

jglick · 2024-10-10T17:12:59Z

+            return node instanceof AbstractCloudSlave ||
+                    (node instanceof Slave && ((Slave) node).getRetentionStrategy() instanceof OnceRetentionStrategy);


Unfortunately this heuristic does not match EC2AbstractSlave extends Slave nor EC2RetentionStrategy extends RetentionStrategy. I guess we need to hard-code support for those nonstandard implementations. CC @car-roll

Can we start introducing a marker interface for this usage?

Maybe, though I think it would suffice for EC2AbstractSlave to extend AbstractCloudSlave and EC2Computer to extend AbstractCloudComputer, with some minor refactoring to delete then-redundant logic. CloudBees-internal reference

jenkinsci/ec2-plugin#998

Vlatombe added 2 commits May 14, 2024 14:51

Fail fast after cloud node removal

6014dde

Remove useless assertion

a68e629

The agent is never named `ghost`, rather `slave0`

Vlatombe changed the title ~~fail fast on cloud node removal~~ Fail fast on cloud node removal May 14, 2024

Vlatombe commented May 14, 2024

View reviewed changes

jglick added the bug label May 14, 2024

Suppress the restart, as the node deletion is processed immediately, …

17b1e37

…this is causing some havoc.

Vlatombe marked this pull request as ready for review May 14, 2024 15:01

Vlatombe requested a review from a team as a code owner May 14, 2024 15:01

Vlatombe closed this May 14, 2024

Vlatombe reopened this May 14, 2024

jglick reviewed May 14, 2024

View reviewed changes

Comment thread src/main/java/org/jenkinsci/plugins/workflow/support/pickles/ExecutorPickle.java Outdated

Vlatombe added 5 commits May 15, 2024 09:24

Simplify execution

f23bf96

* Use only one cause * When build is cancelled immediately, use RemovedNodeCause * When build is cancelled after observing timeout, use RemovedNodeTimeoutCause * Introduce a marker interface to simplify matching in AgentErrorCondition

Simplify test

2e02220

Review: adjust loging levels

118d49e

Add a test covering cloud agents

5753b5d

Remove active wait and remove unused tmp folder while I'm here

c4e7a10

Vlatombe requested a review from jglick May 15, 2024 08:52

Vlatombe added 2 commits May 15, 2024 10:53

Give it a longer timeout

04e044e

Fix invalid expectation

b62ed6f

jglick approved these changes May 15, 2024

View reviewed changes

Vlatombe added 2 commits May 15, 2024 15:45

Remove Retryable under AgentErrorCondition

5a8617e

Use RemovedNodeCause even for static agents removal.

797c108

jglick approved these changes May 15, 2024

View reviewed changes

Vlatombe added 2 commits May 15, 2024 16:13

Fix back expected message

c81e290

Reverting imports

04e4192

jglick approved these changes May 15, 2024

View reviewed changes

jglick merged commit 1891ab0 into jenkinsci:master May 15, 2024

Vlatombe deleted the fail-fast-on-cloud-node-removal branch May 15, 2024 15:08

jglick mentioned this pull request Jul 31, 2024

RemovedNodeListener.cancelOwnerExecution can be noisy #387

Merged

jglick reviewed Oct 10, 2024

View reviewed changes

car-roll mentioned this pull request Oct 28, 2024

support fail fast for node removal jenkinsci/ec2-plugin#998

Closed

7 tasks

jglick mentioned this pull request Mar 5, 2025

Spot VM evictions are not reported to Jenkins, so builds hang and status not reported jenkinsci/azure-vm-agents-plugin#323

Closed

		return node instanceof AbstractCloudSlave \|\|
		(node instanceof Slave && ((Slave) node).getRetentionStrategy() instanceof OnceRetentionStrategy);

Uh oh!

Conversation

Vlatombe commented May 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing done

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Vlatombe commented May 14, 2024

Uh oh!

Vlatombe commented May 14, 2024

Uh oh!

jglick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jglick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jglick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jglick commented May 15, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Vlatombe commented May 14, 2024 •

edited

Loading