Skip to content

Fail fast on cloud node removal#372

Merged
jglick merged 14 commits into
jenkinsci:masterfrom
Vlatombe:fail-fast-on-cloud-node-removal
May 15, 2024
Merged

Fail fast on cloud node removal#372
jglick merged 14 commits into
jenkinsci:masterfrom
Vlatombe:fail-fast-on-cloud-node-removal

Conversation

@Vlatombe
Copy link
Copy Markdown
Member

@Vlatombe Vlatombe commented May 14, 2024

This cancels the node block immediately instead of waiting for the node to come in certain conditions when we are sure the node can't come back: using a cloud node, using OnceRetentionStrategy.

  • Fail fast after cloud node removal
  • Remove useless assertion

Testing done

### Submitter checklist
- [ ] Make sure you are opening from a **topic/feature/bugfix branch** (right side) and not your main branch!
- [ ] Ensure that the pull request title represents the desired changelog entry
- [ ] Please describe what you did
- [ ] Link to relevant issues in GitHub or Jira
- [ ] Link to relevant pull requests, esp. upstream and downstream changes
- [ ] Ensure you have provided tests - that demonstrates feature works or fixes the issue

Vlatombe added 2 commits May 14, 2024 14:51
The agent is never named `ghost`, rather `slave0`
@Vlatombe Vlatombe changed the title fail fast on cloud node removal Fail fast on cloud node removal May 14, 2024
sessions.then(j -> {
// Start up a build and then reboot and take the node offline
assertEquals(0, j.jenkins.getLabel("ghost").getNodes().size()); // Make sure test impl is correctly deleted
assertNull(j.jenkins.getNode("ghost")); // Make sure test impl is correctly deleted
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless assertion, since the node would just have the label ghost but usually named slave0 based on generation.

});
}

@Test public void onceRetentionStrategyNodeDisappearance() throws Throwable {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially a copy of normalNodeDisappearance checking some behavioural differences.

@jglick jglick added the bug label May 14, 2024
@Vlatombe
Copy link
Copy Markdown
Member Author

Looks like some race condition, the Run with InterruptedBuildAction gets saved to disk properly, but somehow it's not visible in the Run object in the next session.

@Vlatombe Vlatombe marked this pull request as ready for review May 14, 2024 15:01
@Vlatombe Vlatombe requested a review from a team as a code owner May 14, 2024 15:01
@Vlatombe
Copy link
Copy Markdown
Member Author

Re-launching CI

@Vlatombe Vlatombe closed this May 14, 2024
@Vlatombe Vlatombe reopened this May 14, 2024
Copy link
Copy Markdown
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by the doubled-up cause of interruption.

s.setRetentionStrategy(new OnceRetentionStrategy(0));
var run = p.scheduleBuild2(0).waitForStart();
j.waitForMessage("+ sleep infinity", run);
j.jenkins.removeNode(s);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is what Reaper in kubernetes would do as soon as an agent pod is deleted.

Copy link
Copy Markdown
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I do not understand why there is a new cause of interruption at all. The only change that should need to be made is for RemovedNodeListener to interrupt the build immediately rather than after a delay, right?

Comment thread src/main/java/org/jenkinsci/plugins/workflow/support/pickles/ExecutorPickle.java Outdated
Vlatombe added 5 commits May 15, 2024 09:24
* Use only one cause
* When build is cancelled immediately, use RemovedNodeCause
* When build is cancelled after observing timeout, use RemovedNodeTimeoutCause
* Introduce a marker interface to simplify matching in AgentErrorCondition
@Vlatombe Vlatombe requested a review from jglick May 15, 2024 08:52
Copy link
Copy Markdown
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler now, thanks. Some optional suggestions.

Comment on lines +357 to +359
if (isOneShotAgent(node)) {
LOGGER.fine(() -> "Cancelling owner run for one-shot agent " + node.getNodeName() + " immediately");
cancelOwnerExecution(node, new RemovedNodeCause());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The crucial part FTR.)

@jglick
Copy link
Copy Markdown
Member

jglick commented May 15, 2024

ShellStepTest.removingAgentIsFatal needs a change to expected message.

@jglick jglick merged commit 1891ab0 into jenkinsci:master May 15, 2024
@Vlatombe Vlatombe deleted the fail-fast-on-cloud-node-removal branch May 15, 2024 15:08
Comment on lines +362 to +363
return node instanceof AbstractCloudSlave ||
(node instanceof Slave && ((Slave) node).getRetentionStrategy() instanceof OnceRetentionStrategy);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this heuristic does not match EC2AbstractSlave extends Slave nor EC2RetentionStrategy extends RetentionStrategy. I guess we need to hard-code support for those nonstandard implementations. CC @car-roll

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we start introducing a marker interface for this usage?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, though I think it would suffice for EC2AbstractSlave to extend AbstractCloudSlave and EC2Computer to extend AbstractCloudComputer, with some minor refactoring to delete then-redundant logic. CloudBees-internal reference

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants