Skip to content

Discard idle instances during shutdown#1125

Merged
fcojfernandez merged 3 commits intojenkinsci:masterfrom
jglick:shutdown
Aug 26, 2025
Merged

Discard idle instances during shutdown#1125
fcojfernandez merged 3 commits intojenkinsci:masterfrom
jglick:shutdown

Conversation

@jglick
Copy link
Copy Markdown
Member

@jglick jglick commented Jul 31, 2025

Extracted from #1121.

While testing the spare instances feature I noticed that while the controller launches a pool of spare instances when it starts, it does not terminate them when it shuts down. If the controller is immediately restarted, this is not a big deal since they will be connected again, but it can leave a mess in a CloudBees CI HA controller where each replica has its own set of spare instances. #1115 will clean them up but it could take hours. With this patch, the spares are neatly cleaned up.

I needed to introduce a Future into the internal API to make sure that the termination request is processed before the web app is unloaded and class loading from plugins becomes impossible; otherwise the termination is unreliable and mostly does not work at all. This does clash a bit with #1015.

@mwebber
Copy link
Copy Markdown

mwebber commented Aug 1, 2025

@jglick, am I correct in understanding that, with this PR, if I update my plugins and restart Jenkins, all my EC2 Agents will be terminated?
See this JENKINS-75746 comment and other comments about cases where people like to keep Agent instances.
Is it possible to make this opt-in only?

@jglick
Copy link
Copy Markdown
Member Author

jglick commented Aug 1, 2025

all my EC2 Agents will be terminated?

Not exactly. Only idle agents; and only up to the number of configured spare instances, if that option is even used. So at most as many would have been launched when the controller starts.

If you think even that is going too far, that some people may be configuring spare instances but prefer for the unused spares to be left running when the controller terminates, then yes of course it could be made into (sigh) yet another checkbox in the agent template.

@jglick
Copy link
Copy Markdown
Member Author

jglick commented Aug 1, 2025

(BTW I have not yet looked into the options for covering this and my other PRs via automated tests. I do not trust the mocks much, but I guess this plugin is set up neither for LocalStack container testing nor for live testing against AWS with a bring-your-own credentials model.)

@mwebber
Copy link
Copy Markdown

mwebber commented Aug 1, 2025

@jglick Thanks for the response. I agree that we seem to have too many checkboxes for configuration in this area, and their meaning, and the relationship between them, is somewhat opaque. I guess that's the result of the rules for starting / stopping / terminating instances having evolved over time.

some people may ... prefer for the unused spares to be left running when the controller terminates

Typically when we restart the controller, most if not all Agent EC2 instances will be in a Stopped state (i.e. not running, but not terminated). What we would like is for those instances to remain as they are, and be associated with and available for use by the (restarted) Controller when required for a job.

I guess this plugin is set up neither for LocalStack container testing nor for live testing against AWS with a bring-your-own credentials model.

I can provide some assistance with basic testing in our AWS environment, if that helps. I can drop a plugin test release into a Jenkins that's idling, run a few jobs, restart controller, watch instances start / stop / terminate, etc.

@jglick
Copy link
Copy Markdown
Member Author

jglick commented Aug 1, 2025

instances will be in a Stopped state (i.e. not running, but not terminated). What we would like is for those instances to remain as they are

I am definitely not planning to implement that, but if that is something that currently works, then it sounds like the behavior in this PR must not be enabled for users with this requirement. So, I will make it an option.

assistance with basic testing in our AWS environment

Sorry, I was unclear—I have already tested this in an EKS cluster with IAM-based access to EC2. Of course additional sanity checks by others are welcome. I was just uncertain what automated test coverage is expected in general for changes in this plugin.

@jglick jglick marked this pull request as draft August 1, 2025 18:39
jglick added a commit to jglick/ec2-plugin that referenced this pull request Aug 1, 2025
@jglick jglick marked this pull request as ready for review August 1, 2025 19:58
@jglick jglick marked this pull request as draft August 2, 2025 12:02
@jglick
Copy link
Copy Markdown
Member Author

jglick commented Aug 2, 2025

I am starting to think it would be more effective for the option to just terminate all idle agents, rather than being tied to the spare count, due to a longstanding bug in the provisioning logic under load (a topic for another PR). I will experiment with this alternative approach.

@jglick jglick changed the title Discard spare instances during shutdown Discard idle instances during shutdown Aug 4, 2025
@jglick
Copy link
Copy Markdown
Member Author

jglick commented Aug 4, 2025

Indeed it does seem more helpful to just (at an option) terminate & remove any idle agents during shutdown, regardless of why they were launched to begin with. This behaves better when the cloud accidentally overprovisions agents in response to a rapid enqueueing of builds: if some of the agents are left unused then they are simply removed the next time you restart.

@jglick jglick marked this pull request as ready for review August 4, 2025 15:01
@jglick
Copy link
Copy Markdown
Member Author

jglick commented Aug 14, 2025

@mwebber I think the revised logic should address your concern?

@mwebber
Copy link
Copy Markdown

mwebber commented Aug 14, 2025

@mwebber I think the revised logic should address your concern?

@jglick That all looks ok, thanks.

@jglick
Copy link
Copy Markdown
Member Author

jglick commented Aug 14, 2025

Thanks. @res0nance / @fcojfernandez?

@fcojfernandez fcojfernandez added the enhancement Feature additions or enhancements label Aug 26, 2025
* Terminates the instance in EC2.
*/
public abstract void terminate();
public abstract Future<?> terminate();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked potential external impacts and it seems that only https://github.com/jenkinsci/ec2-cloud-axis-plugin/blob/4d1a8263cfe04320b311fe26ae613d1805073aa2/src/main/java/hudson/plugins/ec2/Utils.java#L14 would be using this method. Seems safe

@fcojfernandez fcojfernandez merged commit c13c882 into jenkinsci:master Aug 26, 2025
16 checks passed
@jglick jglick deleted the shutdown branch August 26, 2025 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Feature additions or enhancements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants